A side by side comparison of geneid and genscan
|
Because of the lack of well annnotated large genomic sequences, it is
difficult to assess the accuracy of "ab initio" gene finders. We have
attempted to analyze the accuracy of geneid in a number of different
sets. In this page, we presente the results compared with
genscan.
Accuracy measures are the usual ones (see the page
Accuracy of Gene Identification Programs in this server, for a description).
ACCURACY IN DIFFERENT DATA SETS
-
h178
This is a set of well annotated human
sequences. The sort of dataset were genefinders are tipically
evaluated. This set contains only single gene sequences, it is biased
towards short genes, and it is likely to have been included in the
training set of genefinders. Therefore, it is unclear how well results
n this set can be extrapolated to large genomic sequences. genscan outperforms geneid on this set, although geneid tends to predict less false positives
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.89 0.90 0.89 0.66 0.74 0.70 0.18 0.08
genscan 0.97 0.86 0.91 0.83 0.75 0.79 0.06 0.14
-
h178Art
To simulate large genomic sequences, single gene
sequences in h178 have been embedded in simulated intergenic
DNA. Thus, the 178 single gene sequences have been collapsed into 42
multigene sequences of about 200 Kb each. Although preliminary results
seem to indicate that the actual accuracy of genefinders can be better
estimated in this data set than in the original set of single gene sequences,
it is actually unclear how realistic this procedure is, and therefore how well
results in this set can be extrapolated to large genomic sequences.
while genscan outperformed geneid on the set of single gene sequences, geneid outperforms genscan now. genscan, however, is still superior in sensitivity.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.89 0.79 0.84 0.67 0.61 0.64 0.18 0.25
genscan 0.94 0.64 0.78 0.68 0.45 0.57 0.11 0.42
- h800 This is a set of 800 human single gene sequences in
embl, release 56, not present in embl, release 50. Therefore, these
sequences are less likely to have been included in the training sets
of current genefinders (including the current version of geneid). It
is unclear, however, how well annotated these sequences are. In
particular, they may include unannotated genes, and genscan may have
already been used to annotated some of them. genscan and geneid
perform comparably. genscan is more sensitive, while geneid produces less false positives.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.89 0.82 0.85 0.59 0.63 0.61 0.19 0.18
genscan 0.95 0.77 0.85 0.71 0.58 0.64 0.09 0.28
- h800Art The 800 human sequences above embedded in simulated intergenic DNA and collapsed in 195 multigene sequences.
geneid outperforms genscan.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.90 0.76 0.83 0.60 0.52 0.56 0.19 0.32
genscan 0.93 0.61 0.77 0.62 0.36 0.49 0.12 0.50
- SGR-C1 This is a set of 25 real genomic sequences of about 100 Kb each from chromosome 1, annotated at the Sanger Center. It is unclear how exhaustive the annotation is, and it is likely that genscan has been used to obtain it. genscan slightly outperforms geneid on this set.
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.81 0.34 0.57 0.56 0.34 0.45 0.28 0.57
genscan 0.91 0.33 0.61 0.68 0.29 0.48 0.15 0.64
- Chromosome 22 We ran geneid on the sequence
of chromosome 22, and compared the predictions with two different
annotations. genscan released predictions were also compared
to these two sets of annotations. The completeness of the annotation
of chromosome 22 sequences is unclear, and genscan may had
already been used to annotated it. geneid and genscan perform
comparably on this set.
annotation 1
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.23 0.30 0.25 0.56 0.22 0.39 0.19 0.66
genscan 0.23 0.31 0.26 0.58 0.19 0.38 0.14 0.69
annotation 2
nucleotide exon
Sn Sp CC Sn Sp SnSp ME WE
geneid 0.27 0.43 0.34 0.53 0.32 0.43 0.21 0.50
genscan 0.28 0.45 0.35 0.54 0.28 0.41 0.17 0.55
|