GENEID IN DROSOPHILA
G
EN´IS P
ARRA, E
NRIQUE B
LANCO AND R
ODERIC G
UIGÓ
{gparra,eblanco,rguigo}@imim.es
geneid is a program to predict genes in anonymous genomic sequences
designed with a hierarchical structure. In the first step, splice sites, start
and stop codons are predicted and scored along the sequence using
Position Weight Matrices (PWMs). In the
second step, exons are built from the sites. Exons are scored as the
sum of the scores of the defining sites, plus the the log-likelihood
ratio of a Markov Model for coding DNA.
In the last step, from the set of predicted exons,
the gene structure is assembled, maximizing the sum of
the scores of the assembled exons. In this paper we describe the
obtention of PWMs for sites, and the Markov Model of coding DNA
in Drosophila Melanogaster. We also compare other models of coding
DNA with the Markov Model. Finally, we present and discuss the
the results obtained when geneid is used to predict genes in the Adh
region. These results show that the accuracy of geneid ppredictions
compares currently with that of other existing tools, but that geneid is likely to
be more efficient in terms of speed and memory usage.
geneid is available at http://www1.imim.es/
eblanco/GeneId.
1. Introduction
geneid (Guigó et al., 1992) was one of the first programs to predict full
exonic structures of vertebrate genes in anonymous DNA sequences. geneid was designed with a hierarchical structure: First, gene defining
signals (splice sites, start and stop codons) were predicted along the
query DNA sequence. Next, potential exons were constructed from these
sites, and finally the optimal scoring gene prediction was assembled
from the exons. In the original geneid the scoring function to optimize
was rather heuristic: the sequence sites were predicted and scored
using Frequency Matrices (PWMs), a number of coding statistics
were computed on the predicted exons, and each exon was scored as a
function of the scores of the exon defining sites and of the coding
statistics. To estimate the coefficients of this function a neural
network was used. An exhaustive search of the space of possible gene
assemblies was performed to rank predicted genes according with an
score obtained through a complex function of the scores of the
assembled exons.
During the past years geneid had some usage, mostly through a now
nonfunctional e-mail server at Boston University (
geneid@darwin.bu.edu) and through a WWW server at the Institut
Municipal d'Investigació Mèdica (
http://www1.imim.es/geneid.html). During this period, however, there
has been substantial developments in the field of computational gene
identification see Claverie, 1997, and
Burge and Karlin, 1998 for recent reviews, and the original geneid has
became clearly inferior to other existing tools. Therefore, we started
some time ago developing an improved version of the geneid program, at
least as accurate as other existing tools, but much more efficient
handling very large genomic sequences both in terms of speed and usage
of memory. This new version maintains the hierarchical structure
(signal to exon to gene) in the original geneid, but we have simplified
the scoring schema and furnished it with a probabilistic meaning:
Scores for both, exons defining signals and protein coding potential
are computed as log-likelihood ratios, which for a given predicted
exon are summed up into the exon score, in consequence also a
log-likelihood ratio. Then, a dynamic programming algorithm
(Guigó, 1998) is used to search the space of predicted exons to
assemble the gene structure (in the general case, multiple genes in
both strands) maximizing the sum of the scores of the assembled exons,
which can also be assumed to be a log-likelihood ratio. Execution time
in this new version of geneid grows linearly with the size of the
input sequence, currently at about two MegaBases per minute in a
Pentium III (500 Mhz) running linux. The amount of memory required is
also proportional to the length of the sequence, about one MegaByte
per MegaBase plus a constant amount of about 15 MegaBytes,
irrespective of the length of the sequence. In the practice, thus,
geneid is able to analyze sequences of virtually any length, for
instance chromosome size sequences.
In this paper we describe the ``training '' of geneid to predict genes
in the genome of Drosophila melanogaster. In the context of
geneid ``training'' means essentially computing PWMs for splice sites
and start codons, and deriving a model of coding DNA, which, in this
case, is a Markov Model of order five similar to the models introduced
by Borodovsky and McIninch (1993). Thus, in the next sections, we first
describe the training data set used, in particular our attempt to
recreate a more realistic scenario to train and test geneid by
generating semi-artificial large genomic contigs from single-gene DNA
sequences, and we briefly describe the main features of geneid for
D. melanogaster. Then, we present the results obtained in the training data set
when different schemas are used to compute scores for sites and coding
potential, and the results obtained on the D. melanogaster Adh region when the
optimal scoring schema in the training set is used to predict genes in
this region. Finally, we describe the reasons for the poor
performance of geneid in GASP1, and we discuss future developments.
As a set of known D. melanogaster gene encoding sequences, we have merged
the sets of 275 multi- and 141 single-exon sequences provided by
Martin Reese
( ftp://genome.lbl.gov/pub/genesets/Drosophila/) into
a unique data set, the MR-set. From the MR-set we inferred PWMs for
splice sites and start codons, and the Markov Model of order five for
coding regions. The MR-set contains only single gene sequences. To
assess the accuracy of the predictions in a more realistic scenario,
we have randomly embedded the sequences in the MR-set
in a background of artificial random intergenic DNA
as described in Guigó et al. (2000). Thus, a single sequence of
5,689,206 bp embedding the 416 genes in the MR-set has been used to
evaluate the accuracy of the predictions. The sequence, and the
coordinates of the embedded exons are available at http://www1.imim.es/
gparra/GASP1.
As outlined, geneid for D. melanogaster uses PWMs to predict potential splice
sites, and start codons. Potential sites are scored as log-likelihood
ratios. From the set of predicted sites (which includes, in addition,
all potential stop codons), the set is built of all potential
exons. Exons are scored as the sum of the scores of the defining
sites, plus the log-likelihood ratio of the Markov Model for coding
sequences. Finally, the gene structure is assembled from the set of
predicted exons, maximizing the sum of the scores of the assembled
exons. The procedure is illustrated in Figure 1, which shows
the geneid predictions in a small region of the Adh sequence.
Actual splice sites, and start codons were extracted from the
MR-set.
The MR-set contains 757 donor sites. From them, a frequency matrix
was derived from position -3 to +6 around the exon-intron
boundary, being the position 0 the first position in the intron.
is the probability of observing
nucleotide
(
) at position
(
) in an actual donor site. The positional frequency
of
nucleotides in the region -3 to +6 around all dinucleotides GT was
also computed (being the position 0, the position corresponding to the
nucleotide G in the GT dinucleotides.) Then, a Position Weight Matrix
for donor sites,
, was calculated as
 |
(1) |
PWMs for Acceptor sites,
, and Start codons,
, were obtained in
a similar way. These matrices can be obtained from http://www1.imim.es/
gparra/GASP1.
PWMs can be used to score each potential donor site (GT),
acceptor site (AG) and start codon (ATG) along a given sequence.
The score of a potential donor site
within the
sequence is computed as
 |
(2) |
This is the log-likelihood ratio of the probability of
observing this particular sequence
in an actual site versus
the probability of observing
in any false GT site.
Similar scores are computed for Acceptor Sites (
)
and Start Codons (
)
geneid distinguishes four types of ``exons'':
- Initial
- ORFs defined by an start codon and a donor site.
- Internal
- ORFs defined by an acceptor site and a donor site.
- Terminal
- ORFs defined by an acceptor site and a stop codon.
- Single
- ORFs defined by an start codon and a stop codon. This
corresponds to intronless genes.
geneid constructs all potential exons which are compatible with the
predicted sites. (Actually, only the five highest scoring donor sites
within frame are considered for each start codon and acceptor site).
All exon and intron sequences were extracted from the MR multi-exon data set.
A Markov Model of order five was estimated to model both exon, and
intron sequences. That is, we estimated the probability distribution
of each nucleotide given the pentanucleotide preceding it in
exon and intron sequences.
From the exon sequences we estimated this probability for each of the 3
possible frames, building the transition probability
matrices
,
,
.
is the observed probability of finding
hexamer
with
in codon position j,
given that pentamer
is with
in codon position j.
An initial probability matrix,
, was estimated from the
observed pentamer frequencies at each codon position.
From the intron sequences a single transition matrix was computed
, as well as a single initial probability matrix,
.
Then, for each hexamer
and frame
a log-likelihood ratio was computed:
 |
(3) |
as well as for each pentamer
and frame
 |
(4) |
The distributions
and
can be obtained from http://www1.imim.es/
gparra/GASP1.
Then, given a sequence
of length
in frame
,
the coding potential of the sequence is defined as
 |
(5) |
where
is the subsequence of
starting in position
and ending in position
.
The score of a potential exon,
,
defined by sites
(start/acceptor) and
(donor/stop) is computed
as
 |
(6) |
This score can be assumed to be the log-likelihood ratio of
the probability of finding such sites and sequence composition
given an actual exon over the probability of finding it
on a random sequence bounded by AG and GT dinucleotides. Actually,
because
is the logarithm of the ratio of the probability of the
sequence under the coding model over the probability under the
non-coding model (not the under a random model),
only
approximates such a log-likelihood ratio.
geneid predicts gene structures--which can be multiple genes in both
strands-- as sequences of frame-compatible non-overlapping
exons. A minimum intron length of 40bp, and a minimum
intergenic distance of 300bp are enforced. If a gene structure
is
a sequence of exons,
, a natural
scoring function is
 |
(7) |
(g) can be approximately
interpreted as the log-likelihood ratio of the
probability of the defining sites and the hexamer composition of
the resulting product given a gene sequence, over this probability given
a non-gene sequence. In geneid, the gene structure predicted for a given sequence
is the gene maximizing
, among all those gene structures that can be
assembled from the set of predicted exons for the sequence.
Actually, because the number of approximations made, in the practice
the simple sum of log likelihood ratios does not
produce necessarily genes with the ``right'' number of exons (if
tends
to be positive, the genes tend to have a large number of exons, if
tends to be negative, the genes tend to have an small number of
exons), and the score of the exons is corrected by adding a constant
. Thus, given an exon
, the actual score of
is
 |
(8) |
To estimate this constant, a simple optimization procedure was
performed. Genes were predicted in the training semi-artificial
genomic sequence for different values of
, and the value was
chosen that maximized the Correlation Coefficient between
the actual and predicted coding nucleotides. This value was found
to be
.
We tested two additional models of coding DNA before deciding for a
Markov Model of order five: A Codon Usage Model, and a model which
combined a Markov Model of order 1 of the translated amino acid
sequence and a Codon Preference Model see
Guigó, 1999 for details on these models. In both cases,
log-likelihood ratios were obtained in a similar way to the Markov
Model log-likelihood ratios (see Materials and Methods). For instance,
in the case of the Codon Usage Model, for each triplet
, we
estimated the probabilities of the codon
in coding sequences,
and the probability of the triplet in non-coding sequences,
, and built the log-likelihood ratio
. Then, given a sequence
of length
in
frame
(that is,
form a codon), the coding potential of
the sequence is computed as
. The models were inferred from the MR-set, as
the Markov Model was, and tested on the MR-set sequences embedded in the
large artificial genomic ``contig''. To test the models, genes were
predicted using geneid but exons were scored using only the scores
derived under the coding DNA model (that is, the scores from the exon
defining sites were ignored). Predictions were compared with the
annotated genes, and the usual measures of accuracy were computed (see
Reese et al., in this issue). Results are shown in Table
1. For comparison, we also show the results when only the
scores of the sites are used to score the exons. As it is possible to
see the Markov Model of order five produces more accurate results than
the other models, and, thus, it was chosen to be used
in geneid to predict the genes in the Adh region. As described above,
geneid scores the exons as the sum of the scores of the sites and the
Markov Model score. Results under this scoring schema, the one
effectively used to predict genes in the Adh region, are also given
in Table 1.
Table 2 shows the results when geneid with the parameters estimated
above, is used to predict genes in the Adh region. Both, the results
originally submitted to GASP1, and the results obtained with the
currently available version of geneid are given (see discussion). In
addition, we provide information on execution time, and memory
requirements of geneid to analyze the Adh region.
The detailed exon coordinates of the predictions by geneid can be found
at http://www1.imim.es/
gparra/GASP1.
The results presented in the previous section indicate that the
current version of geneid shows an accuracy, as measured by the GASP1
contest, comparable to the accuracy of the programs based on
Hidden Markov Models (HMMs), which in GASP1 exhibited the highest
accuracy. In favour of geneid play the simplicity and modularity of its
structure, which, as a consequence, is likely to make the program more
efficient in terms of speed and memory usage. Indeed, in geneid the gene
identification problem is stated as a one-dimensional chaining
problem for which more efficient algorithms may be designed than for an
aligment problem, as gene identification is
implicitally formulated in HMMs. Against geneid play the somehow less
rigurous probabilistic treatement of the scoring schema. For instance,
we are currently unable to justify the ``magic number'' (
, see
Materials and Methods) which needs to be added to the exon scores to
obtain accurate predictions.
geneid submitted rather poor predictions to GASP1 (see Table
2). When GASP1 was launched we were in the early stages of
redesigning geneid and we certainly overestimated our possibilities of
successfully meeting the deadline. Rush in an attempt to submit the
predictions on time was to blame for two mistakes in the predictions
submitted initially (submitted(1) in Table 2): First, there
was a bug which caused predictions in the reverse strand to be
incorret (in fact, predicted genes in the reverse strand could not be
meaningfully translated into proteins); Second, the minimum
intergenenic length was set to 5,000 bp --far too high a number,
given that this is likely to be the average intergenic length in the
D. melanogaster genome. Thus, there was little room for geneid to fit the genes
along the Adh sequence. We discovered these problems before the GASP1
results were made public, and send a new set of predictions
(submitted(2) in Table 2), but it was too late for the
compfly team to include them among the results published inititally. A
more serious error concerning the geneid scoring schema was discovered
after GASP1. Because we knew that the likelihoods that we computed
were only approximate, we assumed that the contribution of the sites
and the Markov Models score to the exon score should be
assymetrical. Thus, the score of an exon
,
was computed as
 |
(9) |
instead as in equation (6). A complex procedure was set up to chose
to maximize the average Relative Rank within the set of predicted
exons in the MR-set. The Relative Rank of the set of exons predicted
for a given sequence is the inverse of the ratio of the sum of the
ranks of the actual exons among the predicted ones when sorted by
score, over the minimum sum of ranks for such set of actual exons.
The optimal
in the MR dataset was found to be
, meaning
that most of the contribution to the exon score came from the sites
score. After GASP1 we realized that the optimization criteria was
wrong, because it did not distinguish between false predicted exons
overlapping actual exons, and predicted exons fully contained within
non-coding sequences. Thus, we decided to optimize
at the gene prediction level
as we did with
(See Materials and Methods), and at the same time
that
. The optimal
was found to be now,
, instead.
This is the scoring schema used to obtain the current predicitions (Table
2). Actually, the gain of using this weighted sum to score the
exons, instead of the scoring function in equation (6) is only
marginal, and we are considering reverting to this, more natural, function.
Although currently fully functional, we are still further developing
geneid. Our short term plans include, among others, to
train geneid to predict genes in the human and the Arabidopsis
genomes, and to include the possibility of incorporating the results
of database searches--both EST and proteins--in the geneid prediction
schema, which can be done rather naturally. The possbility of
including external evidence to ``force'' known genes or exons into the
prediction is already included in the working
version of geneid. This may be useful for reannotation of very large
genomic sequences. Finally, the current structure of geneid can be highly
paralelized, and we are also working in this direction.
We thank Josep F. Abril and Moisès Burset for helpful discussions
and constant encouragement. This
work was supported by grant from ``Plan Nacional de I+D'',
BIO98-0443-C02-01, ``Ministerio de Educación y Ciencia
(Spain)''.
-
Borodovsky, M. and McIninch, J. (1993).
- Genmark: Parallel gene recognition for both DNA strands.
Computer and Chemistry, 17:123-13.
-
Burge, C. B. and Karlin, S. (1998).
- Finding the genes in genomic DNA.
Current Opinion in Structural Biology, 8:346-354.
-
Claverie, J. M. (1997).
- Computational methods for the identification of genes in vertebrate
genomic sequences.
Human Molecular Genetics, 6:1735-1744.
-
Guigó, R. (1998).
- Assembling genes from predicted exons in linear time with dynamic
programming.
Journal of Computational Biology, 5:681-702.
-
Guigó, R. (1999).
- DNA composition, codon usage and exon prediction.
In Bishop, M., editor, Nucleic Acid and Protein Databases,
pages 53-80. Academic Press.
-
Guigó, R., Agarwal, P., Abril, J., Burset, M., and Fickett, J. (2000).
- Gene prediction accuracy in large DNA sequences, submitted.
-
Guigó, R., Knudsen, S., Drake, N., and Smith, T. F. (1992).
- Prediction of gene structure.
Journal of Molecular Biology, 226:141-157.
Table 1:
Testing different models of coding DNA in the training
semi-artificial genomic sequence. CU is a Codon Usage Model. DIA+CP is a
combination of a Markov Model of order 1 of the translated amino acid
sequence and of a Codon Preference Model. MM-5 is a Markov Model of order
5. Genes have been predicted using geneid but in each case exons have
been scored on the basis solely of the coding DNA model--ignoring the
contribution of the exon defining sites. Predicted genes have been
compared with the annotated ones, and the usual measures of accuracy
computed. For comparison purposes, results obtained when exons are
scored as a function only of the scores of the defining sites are also
given (Sites-PWM). Finally, we report the results on accuracy when the
exons are scored as the sum of the Markov Model score and the scores
of the exon defining sites. This is the scoring schema used by geneid when
attempting to predict genes in the Adh region.
| |
Base level |
Exon level |
|
|
|
|
| |
Sn |
Sp |
CC |
Sne |
Spe |
SnSp |
ME |
WE |
|
|
|
|
| Sites-PWM |
0.23 |
0.65 |
0.37 |
0.17 |
0.13 |
0.15 |
0.72 |
0.79 |
|
|
|
|
| CU |
0.91 |
0.88 |
0.88 |
0.46 |
0.43 |
0.45 |
0.21 |
0.27 |
|
|
|
|
| DIA+CP |
0.91 |
0.88 |
0.89 |
0.46 |
0.46 |
0.46 |
0.23 |
0.25 |
|
|
|
|
| MM-5 |
0.93 |
0.90 |
0.91 |
0.54 |
0.51 |
0.52 |
0.18 |
0.24 |
|
|
|
|
| PWM & MM-5 |
0.92 |
0.92 |
0.92 |
0.75 |
0.71 |
0.73 |
0.12 |
0.18 |
|
|
|
|
|
Table 2:
Accuracy of geneid in the Adh region. The std1 annotation data
set was used to evaluate sensitivity, and the std3 annotation data set
to evaluate specificity, as in GASP1 (see Reese et al., in this
issue). Discrepancies between the accuracy of the submitted
predictions--both the initial ones (1) and the corrected (2), and the
accuracy of the predictions obtained with the current version of geneid are due to a number of errors during the process of generating the
submitted predictions (see discussion). The decrease in the amount of
memory required to obtain the predictions is due to
algorithmic developments occurred after GASP1.
| |
Base level |
Exon level |
22cm CPU Time |
22cm Memory |
| |
Sn |
Sp |
Sn |
Sp |
ME |
WE |
|
|
| |
(Std1) |
(Std3) |
(Std1) |
(Std3) |
(Std1) |
(Std3) |
|
|
| geneid, Submitted(1) |
0.48 |
0.84 |
0.27 |
0.29 |
54.
4 |
47.9 |
74 s |
500 MB |
| geneid, Submitted(2) |
0.86 |
0.82 |
0.59 |
0.34 |
21.
0 |
48.0 |
74 s |
500 MB |
| geneid, Current |
0.96 |
0.92 |
0.70 |
0.62 |
11.0 |
17.0 |
83 s |
18.11 MB |
|
Figure:
Predictions obtained by geneid in the region 462500-477500
from the Adh sequence, compared with the annotation in the standard
std3 set. In a first step, geneid identifies and scores all
possible donor (blue) and acceptor (yellow) sites, start codons
(green) and stop codons (red) using Position Weight Matrices--the
height of the corresponding spike in the figure proportional to the
site score. 4,704 sites were generated along this 15,000 bp region by
geneid, only the highest scoring ones displayed here. In a second step,
geneid builds all exons compatible with these sites. 11,967 exons were
built in this particular region (not displayed in the figure). Exons
are scored as the sum of the scores of the defining sites, plus the
score of their coding potential measured according with a Markov Model
of order five. In the figure, the coding potential is displayed along
the DNA sequence ( MM_score). Regions strong in red are more
likely to be coding than regions strong in blue. From the set of
predicted exons, the gene structure is generated maximizing the sum of
the scores of the assembled exons. In the figure, exons assembled into
the predicted genes are drawn with a height proportional to its
score. A two color code is used to indicate frame compatibility: two
adjacent exons are frame compatible if the right half of the upstream
exon (the remainder) matches the color of the left half of the
downstream exon (the frame). The figure has been obtained using the
gff2ps program (available at
http://www1.imim.es/
jabril/GFFTOOLS/GFF2PS.html). The
input GFF and the configuration files required for gff2ps to
generate this figure can be found at http://www1.imim.es/
gparra/GASP1.
|
This document was generated using the
LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers main
The translation was initiated by Roderic Guigo Serra on 2000-02-07
Footnotes
- ... Guig\'o
- To whom
correspondence should be addressed
Roderic Guigo Serra
2000-02-07