
"Geneid in Drosophila."
G. Parra, E. Blanco, and R. Guigó.
Genome Research 10(4):511515 (2000).
In this site we describe all the data used used for geneid "training".
Data Set


We used the set of 275 multiexon genes and the set of 141 singleexon sequences provided by Martin Reese:
We have randomly embedded the sequences in the MRset in a background of artificial random intergenic DNA.
This new sequence of 5,689,209 bp was used to assess the accuracy of the predictions:
Positional Weight Matrices & Coding statistic


 Positional Weight Matrices:
 From the data set we obtain the following PWM :
 Coding statistic:
 From the data set we obtain this matrices from Markov Model of order five :
 All this parameters are also in the param.default file,
included in the last version of geneid (freely available).
geneid Predictions in Adh Region


Here you can download the predictions of the final version of geneid (v 3.0 EvoI) in gff format. For more information about how geneid works see the PAPER, submmited to Genome Research.
Example of geneid annotation


GeneId predictions obtained in the region 462500477500 from the Adh sequence, compared with the annotation in the standard std3 set.
In a first step, GeneId identifies all possible donor (blue) and acceptor (yellow) sites, start codons (green) and stop codons (red) using Positions Weight Matrices (Sites file).
In the second step, GeneId builds all exons compatible with this site. Exons are scored as the sum of the scores of the defining sites, plus the score of their potential measured according with a Markov Model of order five.
In the figure, the coding potential is displayed along the DNA sequence, regions strong in red are more likely to be coding than region strong in blue (Markov vector file).
The final gene structure is generated maximizing the sum of the scores of the assembled exons.
The predicted exons for this sequence can be found (here) and the Std3 annotation for this region (here).
The figure has been obtained using the gff2ps program.

