|
- What's geneid?
- Main features
- Examples
- Training geneid
- Accuracy
- Gene predictions on genomes
- Speed
- Source code distribution
- geneid parameter files
- geneid web server
- If you encounter problems ...
- References
- Authors and acknowledgements
geneid is a program to predict genes in anonymous genomic sequences
designed with a hierarchical structure. In the first step, splice sites, start
and stop codons are predicted and scored along the sequence using
Position Weight Arrays (PWAs). In the second step, exons are built from the sites.
Exons are scored as the sum of the scores of the defining sites, plus the the
log-likelihood ratio of a Markov Model for coding DNA. Finally, from the set of predicted exons,
the gene structure is assembled, maximizing the sum of the scores of the assembled exons. geneid
offers some type of support to integrate predictions from multiple source via external gff
files and the redefinition of the general gene structure or model is also feasible.
The accuracy of geneid compares favorably to that of other existing tools, but
geneid is likely more efficient in terms of speed and memory usage.
Currently, geneid v1.2 analyzes the whole human genome in 3 hours (approx. 1 Gbp / hour)
on a processor Intel(R) Xeon CPU 2.80 Ghz.
- geneid accuracy compares to that of other existing
"ab initio" gene prediction tools.
- geneid is very efficient in terms of speed and memory usage. In
practice, geneid can analyze chromosome size sequences at a rate of
about 1 Gbp per hour on the Intel(R) Xeon CPU 2.80 Ghz. For the largest
human chromosome (chr1), it requires 1/2 Gbyte of RAM plus the size of the Fasta
sequence.
.
- geneid offers support to integrate predictions from multiple sources
(ESTs, blast HSPs) and to reannotate genomic sequences, via external gff files
and together with the redefinition of the "gene model".
- geneid output can be customized to different levels of
detail, including exhaustive listing of potential signals and exons. Furthermore,
several output formats as gff or XML are available.
- There are available parameter files in geneid v 1.2 for Drosophila Melanogaster,
human (which can be also used for vertebrate genomes), Dictyostelium discoideum
and Tetraodon nigroviridis (which can be used for Fugu rubripes) among many others for species spanning the four "classical" kingdoms. The additional currently available parameter files can be found under the section "geneid parameter files" .
SAMPLES:
FORMATS:
In order to build a parameter file for geneid it is necessary to "train" the program and parameter configurations exist for a number of eukaryotic species. Training basically consists of computing position weight matrices (PWMs) or Markov models for the splice sites and start codong and deriving a model for coding DNA (generally a Markov model of order 4 or 5). The basic requirements for a training set is an annotation file (preferably in geneid gff format and a set of fasta sequences corresponding to the gene models in the annotation file.
Generally as few as 100 gene models could be enough to build a reasonably accurate geneid parameter file, but generally a user would want to have as many sequences as possible (> 500) to build an optimally accurate matrix and also to be able to set aside some of the gene models for testing purposes (see training document).
If a user wants to evaluate the accuracy of the newly developed parameter file she will also require an annotation file and fasta files corresponding to the sequences in the evaluation set. However if a user only has a limited number of gene models to train geneid with (generally < 500 sequences) she can use a "leave-one-out strategy" for evaluating the accuracty (more information in the training tutorial).
The user can go through an example of a typical geneid "training" protocol (Training geneid for the parasite Perkinsus marinus) by following this tutorial
This link contains the set of predicted genes
using geneid on the recently sequenced genomes (Drosophila melanogaster, Homo sapiens, Mus musculus,
Fugu rubripes or Dictyostelium discoideum) for some of their most common releases.
Because of the lack of well annotated large genomic sequences, it is
difficult to assess the accuracy of "ab initio" gene finders. We have
attempted to analyze the accuracy of geneid in a number of different
sets. We believe that in the analysis of large genomic sequences geneid may be superior to other existing tools. A side by side comparison with
genscan can be found here.
The benchmark sequence is the human Chromosome 1 (239 Mb) extracted from the goldenPath-UCSC assembly
(July 2003 release):
Computer
|
Intel Pentium Intel(R) Xeon CPU 2.80 Ghz. 4Gb RAM
|
CPU/real time(s)
|
1025 / 1045 secs
|
geneid distributions contains several directories and files compressed in
tar.gz file. Source code and documentation files are included in the distribution,
as well as several parameters files and other extra information.
All of the files can be obtained from our ftp server:
Cummulative change log: ChangeLog
geneid v 1.4.4 (current development version):
-
geneid v 1.4.4 full distribution: source code and documentation
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.4.4.Jan_13_2011.tar.gz
-> 05c00f283a8fa996418aff0bc8db1c6d
-
geneid v 1.4.4 full distribution: source code and documentation
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.4.4.Jan_13_2011.tar.gz
-> 05c00f283a8fa996418aff0bc8db1c6d
geneid v 1.3 preview release 3 (version used for NGASP phase II category 4):
-
geneid v 1.3 full distribution: source code and documentation
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.3.Mar_30_2007.tar.gz
-> 10cad4e6ae25a57fcc6bb062692626ae
geneid v 1.3 preview release 1 (version used for NGASP phase I category 1):
-
geneid v 1.3 full distribution: source code and documentation
(documentation does not yet reflect new features; for help, type geneid -h)
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.3.Dec_21_2006.tar.gz
-> 1ff0f870e5ec5a553e4603102a9d7c62
geneid v 1.2:
-
geneid v 1.2 full distribution: source code and documentation
[DOWNLOAD]
Note: Please, verify the check-sum file value
Type: md5sum geneid_v1.2.March_1_2005.tar.gz
-> 6f350210ead7e49ac76be1fd17ef91f9
-
geneid v 1.2 Solaris 64-bits distribution
(Makefiles optimized by Mithun Sridharan, Sun Microsystems GmbH)
[FULL VERSION - DOWNLOAD]
[BINARY FILE]
-
geneid v 1.2 Linux binary (gcc version 3.3.1)
[DOWNLOAD]
-
geneid v 1.2 documentation (HTML)
[READ]
Instructions to install geneid in your computer.
Old releases:
geneid v 1.1:
-
geneid v 1.1 full distribution: source code and documentation
[DOWNLOAD]
-
geneid v 1.1 Linux binary (gcc version 2.95 19990728)
[DOWNLOAD]
-
geneid v 1.1 documentation (HTML)
[DOWNLOAD]
[READ]
geneid v 1.0:
- geneid v 1.0 full distribution: source code and documentation
[DOWNLOAD]
- geneid v 1.0 binary files for some architectures
Linux,
SGI and
Solaris.
- geneid v 1.0 documentation (PostScript)
[DOWNLOAD]
geneid v 1.0 (Parallel version):
-- Requires UNIX/LINUX pthreads library --
- geneid Parallel full distribution: source code and documentation
[DOWNLOAD]
geneid has been trained on several species and it is being trained
on other genomes as well. See this help
for more details about the different parts of parameter files as well as their statistical meaning.
- - The parameter files for geneid v 1.2 are not compatible with previous versions - -
- - The parameter files for geneid v 1.3 and 1.4 are not back-compatible with previous versions, however, version 1.2 parameter files ARE forward-compatible with version 1.3 and 1.4 - -
List of available parameter files (geneid v 1.3 and 1.4 ):
- Homo sapiens (suitable for vertebrates) (UPDATED - January 2nd, 2007)
- Drosophila melanogaster (suitable for fly and mosquito) (UPDATED - January 2nd, 2007)
- Acyrthosiphon pisum (This version of the aphid parameter file detects GC donors and requires
geneid v 1.3 and above )
List of available ANIMAL parameter files (geneid v 1.2 and above ):
List of available PROTIST parameter files (geneid v 1.2 and above ):
List of available PLANT parameter files (geneid v 1.2 and above ):
List of available FUNGI parameter files (geneid v 1.2 and above ):
List of available parameter files for OLDER VERSION OF GENEID (geneid v 1.1 ):
A geneid web server is available to submit sequences over the Internet. There is no limit to the length of the submitted sequence, other than the imposed by the Internet (except when plotting is required).
If you encounter problems using geneid, or have suggestions on how to improve it send an e-mail to
geneid@crg.es
- E. Blanco, G. Parra and R. Guigó,
"Using geneid to Identify Genes.",
In A. Baxevanis, editor:
Current Protocols in Bioinformatics. Unit 4.3.
John Wiley & Sons Inc., New York (2002) (in press)
- E. Blanco, G. Parra, S. Castellano, J.F. Abril,
M. Burset, X. Fustero, X. Messeguer and R. Guigó
"Gene Prediction in the Post-Genomic Era."
IX th ISMB (Poster), Copenhagen, Denmark (2001)
- G. Parra, E. Blanco, and R. Guigó,
"Geneid in Drosophila",
Genome Research 10(4):511-515 (2000).
- R. Guigó,
"Assembling genes from predicted exons in linear time with dynamic programming",
Journal of Computational Biology, 5:681-702 (1998).
- R. Guigó, S. Knudsen, N. Drake, and T. F. Smith,
"Prediction of gene structure",
Journal of Molecular Biology, 226:141-157 (1992).
The current version of geneid has been written by
Enrique Blanco,
Tyler Alioto and
Roderic Guigó.
The parameter files have been constructed by
Genis Parra,
Tyler Alioto
and Francisco Camara.
With contributions from Josep F.Abril, Moises Burset and Xavier Messeguer.
This training tutorial document was prepared by:
Francisco Camara.
CopyRight © 2002
geneid is under GNU General Public License.
|
|