next_group up previous


GENEID IN DROSOPHILA

GEN´IS PARRA, ENRIQUE BLANCO AND RODERIC GUIGÓ

{gparra,eblanco,rguigo}@imim.es

Abstract

geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the the log-likelihood ratio of a Markov Model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov Model of coding DNA in Drosophila Melanogaster. We also compare other models of coding DNA with the Markov Model. Finally, we present and discuss the the results obtained when geneid is used to predict genes in the Adh region. These results show that the accuracy of geneid ppredictions compares currently with that of other existing tools, but that geneid is likely to be more efficient in terms of speed and memory usage. geneid is available at http://www1.imim.es/$\sim $eblanco/GeneId.


1. Introduction

geneid (Guigó et al., 1992) was one of the first programs to predict full exonic structures of vertebrate genes in anonymous DNA sequences. geneid was designed with a hierarchical structure: First, gene defining signals (splice sites, start and stop codons) were predicted along the query DNA sequence. Next, potential exons were constructed from these sites, and finally the optimal scoring gene prediction was assembled from the exons. In the original geneid the scoring function to optimize was rather heuristic: the sequence sites were predicted and scored using Frequency Matrices (PWMs), a number of coding statistics were computed on the predicted exons, and each exon was scored as a function of the scores of the exon defining sites and of the coding statistics. To estimate the coefficients of this function a neural network was used. An exhaustive search of the space of possible gene assemblies was performed to rank predicted genes according with an score obtained through a complex function of the scores of the assembled exons.

During the past years geneid had some usage, mostly through a now nonfunctional e-mail server at Boston University ( geneid@darwin.bu.edu) and through a WWW server at the Institut Municipal d'Investigació Mèdica ( http://www1.imim.es/geneid.html). During this period, however, there has been substantial developments in the field of computational gene identification see Claverie, 1997, and Burge and Karlin, 1998 for recent reviews, and the original geneid has became clearly inferior to other existing tools. Therefore, we started some time ago developing an improved version of the geneid program, at least as accurate as other existing tools, but much more efficient handling very large genomic sequences both in terms of speed and usage of memory. This new version maintains the hierarchical structure (signal to exon to gene) in the original geneid, but we have simplified the scoring schema and furnished it with a probabilistic meaning: Scores for both, exons defining signals and protein coding potential are computed as log-likelihood ratios, which for a given predicted exon are summed up into the exon score, in consequence also a log-likelihood ratio. Then, a dynamic programming algorithm (Guigó, 1998) is used to search the space of predicted exons to assemble the gene structure (in the general case, multiple genes in both strands) maximizing the sum of the scores of the assembled exons, which can also be assumed to be a log-likelihood ratio. Execution time in this new version of geneid grows linearly with the size of the input sequence, currently at about two MegaBases per minute in a Pentium III (500 Mhz) running linux. The amount of memory required is also proportional to the length of the sequence, about one MegaByte per MegaBase plus a constant amount of about 15 MegaBytes, irrespective of the length of the sequence. In the practice, thus, geneid is able to analyze sequences of virtually any length, for instance chromosome size sequences.

In this paper we describe the ``training '' of geneid to predict genes in the genome of Drosophila melanogaster. In the context of geneid ``training'' means essentially computing PWMs for splice sites and start codons, and deriving a model of coding DNA, which, in this case, is a Markov Model of order five similar to the models introduced by Borodovsky and McIninch (1993). Thus, in the next sections, we first describe the training data set used, in particular our attempt to recreate a more realistic scenario to train and test geneid by generating semi-artificial large genomic contigs from single-gene DNA sequences, and we briefly describe the main features of geneid for D. melanogaster. Then, we present the results obtained in the training data set when different schemas are used to compute scores for sites and coding potential, and the results obtained on the D. melanogaster Adh region when the optimal scoring schema in the training set is used to predict genes in this region. Finally, we describe the reasons for the poor performance of geneid in GASP1, and we discuss future developments.

2. Materials and Methods

2.1 Data sets

As a set of known D. melanogaster gene encoding sequences, we have merged the sets of 275 multi- and 141 single-exon sequences provided by Martin Reese
( ftp://genome.lbl.gov/pub/genesets/Drosophila/) into a unique data set, the MR-set. From the MR-set we inferred PWMs for splice sites and start codons, and the Markov Model of order five for coding regions. The MR-set contains only single gene sequences. To assess the accuracy of the predictions in a more realistic scenario, we have randomly embedded the sequences in the MR-set in a background of artificial random intergenic DNA as described in Guigó et al. (2000). Thus, a single sequence of 5,689,206 bp embedding the 416 genes in the MR-set has been used to evaluate the accuracy of the predictions. The sequence, and the coordinates of the embedded exons are available at http://www1.imim.es/ $\sim $gparra/GASP1.

2.2 geneid

As outlined, geneid for D. melanogaster uses PWMs to predict potential splice sites, and start codons. Potential sites are scored as log-likelihood ratios. From the set of predicted sites (which includes, in addition, all potential stop codons), the set is built of all potential exons. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of the Markov Model for coding sequences. Finally, the gene structure is assembled from the set of predicted exons, maximizing the sum of the scores of the assembled exons. The procedure is illustrated in Figure 1, which shows the geneid predictions in a small region of the Adh sequence.

2.2.1 Predicting and scoring sites

Actual splice sites, and start codons were extracted from the MR-set.

2.2.1.1 donor sites

The MR-set contains 757 donor sites. From them, a frequency matrix $P$ was derived from position -3 to +6 around the exon-intron boundary, being the position 0 the first position in the intron. $P_{ij}$ is the probability of observing nucleotide $i$ ( $i \in \{A,C,G,T\}$) at position $j$ ( $j \in \{-3,
\cdots , +6\}$) in an actual donor site. The positional frequency $Q$ of nucleotides in the region -3 to +6 around all dinucleotides GT was also computed (being the position 0, the position corresponding to the nucleotide G in the GT dinucleotides.) Then, a Position Weight Matrix for donor sites, $D$, was calculated as


\begin{displaymath}
D_{ij} = \log (\frac{P_{ij}}{Q_{ij}})
\end{displaymath} (1)

PWMs for Acceptor sites, $A$, and Start codons, $S$, were obtained in a similar way. These matrices can be obtained from http://www1.imim.es/ $\sim $gparra/GASP1.

PWMs can be used to score each potential donor site (GT), acceptor site (AG) and start codon (ATG) along a given sequence. The score of a potential donor site $S = s_1 s_2 \ldots s_{10}$ within the sequence is computed as

\begin{displaymath}
L_D (S) = \sum_{i=1}^{10} D_{s_ii}
\end{displaymath} (2)

This is the log-likelihood ratio of the probability of observing this particular sequence $S$ in an actual site versus the probability of observing $S$ in any false GT site. Similar scores are computed for Acceptor Sites ($L_A$) and Start Codons ($L_B$)

2.2.2 Predicting and Scoring Exons

geneid distinguishes four types of ``exons'':
Initial
ORFs defined by an start codon and a donor site.
Internal
ORFs defined by an acceptor site and a donor site.
Terminal
ORFs defined by an acceptor site and a stop codon.
Single
ORFs defined by an start codon and a stop codon. This corresponds to intronless genes.
geneid constructs all potential exons which are compatible with the predicted sites. (Actually, only the five highest scoring donor sites within frame are considered for each start codon and acceptor site).

2.2.2.1 Coding Potential

All exon and intron sequences were extracted from the MR multi-exon data set. A Markov Model of order five was estimated to model both exon, and intron sequences. That is, we estimated the probability distribution of each nucleotide given the pentanucleotide preceding it in exon and intron sequences. From the exon sequences we estimated this probability for each of the 3 possible frames, building the transition probability matrices $F^1$, $F^2$, $F^3$. $F^j(s_1 s_2 s_3 s_4 s_5 s_6)$ is the observed probability of finding hexamer $s_1 s_2 s_3 s_4 s_5 s_6$ with $s_1$ in codon position j, given that pentamer $s_1 s_2 s_3 s_4 s_5$ is with $s_1$ in codon position j. An initial probability matrix, $I^j$, was estimated from the observed pentamer frequencies at each codon position. From the intron sequences a single transition matrix was computed $F_0$, as well as a single initial probability matrix, $I_0$. Then, for each hexamer $h$ and frame $j$ a log-likelihood ratio was computed:
\begin{displaymath}
LF^j (h) = \log \frac{F^j(h)}{F_0(h)}
\end{displaymath} (3)

as well as for each pentamer $p$ and frame $j$
\begin{displaymath}
LI^j (p) = \log \frac{I^j(h)}{I_0(h)}
\end{displaymath} (4)

The distributions $F$ and $I$ can be obtained from http://www1.imim.es/ $\sim $gparra/GASP1.

Then, given a sequence $S$ of length $l$ in frame $j$, the coding potential of the sequence is defined as

\begin{displaymath}
L_M (S) = LI^j (S_{1..5}) + \sum_{i=1}^{l-5} LF^j (S_{i..i+5})
\end{displaymath} (5)

where $S_{i..j}$ is the subsequence of $S$ starting in position $i$ and ending in position $j$.

The score of a potential exon, $S$, $L_E(S)$ defined by sites $s_a$ (start/acceptor) and $s_d$ (donor/stop) is computed as


\begin{displaymath}
L_E(S) = L_A(s_a) + L_D(s_d) + L_M (S)
\end{displaymath} (6)

This score can be assumed to be the log-likelihood ratio of the probability of finding such sites and sequence composition given an actual exon over the probability of finding it on a random sequence bounded by AG and GT dinucleotides. Actually, because $L_M$ is the logarithm of the ratio of the probability of the sequence under the coding model over the probability under the non-coding model (not the under a random model), $L_E$ only approximates such a log-likelihood ratio.

2.2.3 Assembling Genes

geneid predicts gene structures--which can be multiple genes in both strands-- as sequences of frame-compatible non-overlapping exons. A minimum intron length of 40bp, and a minimum intergenic distance of 300bp are enforced. If a gene structure $g$ is a sequence of exons, $e_1 , e_2 , \cdots , e_n$, a natural scoring function is
\begin{displaymath}
L_G (g) = L_E(e_1)+L_E(e_2)+ \cdots + L_E(e_n)
\end{displaymath} (7)

$L_G$ (g) can be approximately interpreted as the log-likelihood ratio of the probability of the defining sites and the hexamer composition of the resulting product given a gene sequence, over this probability given a non-gene sequence. In geneid, the gene structure predicted for a given sequence is the gene maximizing $L_G (g)$, among all those gene structures that can be assembled from the set of predicted exons for the sequence. Actually, because the number of approximations made, in the practice the simple sum of log likelihood ratios does not produce necessarily genes with the ``right'' number of exons (if $L_E$ tends to be positive, the genes tend to have a large number of exons, if $L_E$ tends to be negative, the genes tend to have an small number of exons), and the score of the exons is corrected by adding a constant $IW$. Thus, given an exon $e$, the actual score of $e$ is
\begin{displaymath}
L_E^*(e) = L_E (e) + IW
\end{displaymath} (8)

To estimate this constant, a simple optimization procedure was performed. Genes were predicted in the training semi-artificial genomic sequence for different values of $IW$, and the value was chosen that maximized the Correlation Coefficient between the actual and predicted coding nucleotides. This value was found to be $IW=-7$.

3. Results

3.1 Training geneid

We tested two additional models of coding DNA before deciding for a Markov Model of order five: A Codon Usage Model, and a model which combined a Markov Model of order 1 of the translated amino acid sequence and a Codon Preference Model see Guigó, 1999 for details on these models. In both cases, log-likelihood ratios were obtained in a similar way to the Markov Model log-likelihood ratios (see Materials and Methods). For instance, in the case of the Codon Usage Model, for each triplet $s$, we estimated the probabilities of the codon $s$ in coding sequences, $U(s)$ and the probability of the triplet in non-coding sequences, $U_0(s)$, and built the log-likelihood ratio $LU (s) = \log
\frac{G(s)}{G_0(s)}$. Then, given a sequence $S$ of length $l$ in frame $0$ (that is, $S_1S_2S_3$ form a codon), the coding potential of the sequence is computed as $L_C (S) = \sum_{i=1, 4, 7, \ldots}^{l-2}
LU(S_iS_{i+1}S_{i+2})$. The models were inferred from the MR-set, as the Markov Model was, and tested on the MR-set sequences embedded in the large artificial genomic ``contig''. To test the models, genes were predicted using geneid but exons were scored using only the scores derived under the coding DNA model (that is, the scores from the exon defining sites were ignored). Predictions were compared with the annotated genes, and the usual measures of accuracy were computed (see Reese et al., in this issue). Results are shown in Table 1. For comparison, we also show the results when only the scores of the sites are used to score the exons. As it is possible to see the Markov Model of order five produces more accurate results than the other models, and, thus, it was chosen to be used in geneid to predict the genes in the Adh region. As described above, geneid scores the exons as the sum of the scores of the sites and the Markov Model score. Results under this scoring schema, the one effectively used to predict genes in the Adh region, are also given in Table 1.

3.2 Results in the Adh Region

Table 2 shows the results when geneid with the parameters estimated above, is used to predict genes in the Adh region. Both, the results originally submitted to GASP1, and the results obtained with the currently available version of geneid are given (see discussion). In addition, we provide information on execution time, and memory requirements of geneid to analyze the Adh region. The detailed exon coordinates of the predictions by geneid can be found at http://www1.imim.es/$\sim $gparra/GASP1.

4. Discussion

The results presented in the previous section indicate that the current version of geneid shows an accuracy, as measured by the GASP1 contest, comparable to the accuracy of the programs based on Hidden Markov Models (HMMs), which in GASP1 exhibited the highest accuracy. In favour of geneid play the simplicity and modularity of its structure, which, as a consequence, is likely to make the program more efficient in terms of speed and memory usage. Indeed, in geneid the gene identification problem is stated as a one-dimensional chaining problem for which more efficient algorithms may be designed than for an aligment problem, as gene identification is implicitally formulated in HMMs. Against geneid play the somehow less rigurous probabilistic treatement of the scoring schema. For instance, we are currently unable to justify the ``magic number'' ($IW$, see Materials and Methods) which needs to be added to the exon scores to obtain accurate predictions.

geneid submitted rather poor predictions to GASP1 (see Table 2). When GASP1 was launched we were in the early stages of redesigning geneid and we certainly overestimated our possibilities of successfully meeting the deadline. Rush in an attempt to submit the predictions on time was to blame for two mistakes in the predictions submitted initially (submitted(1) in Table 2): First, there was a bug which caused predictions in the reverse strand to be incorret (in fact, predicted genes in the reverse strand could not be meaningfully translated into proteins); Second, the minimum intergenenic length was set to 5,000 bp --far too high a number, given that this is likely to be the average intergenic length in the D. melanogaster genome. Thus, there was little room for geneid to fit the genes along the Adh sequence. We discovered these problems before the GASP1 results were made public, and send a new set of predictions (submitted(2) in Table 2), but it was too late for the compfly team to include them among the results published inititally. A more serious error concerning the geneid scoring schema was discovered after GASP1. Because we knew that the likelihoods that we computed were only approximate, we assumed that the contribution of the sites and the Markov Models score to the exon score should be assymetrical. Thus, the score of an exon $S$, $L_E(S)$ was computed as

\begin{displaymath}
L_E(S) = (w * (L_A(s_a) + L_D(s_d))) + (1-w) * L_M (S)
\end{displaymath} (9)

instead as in equation (6). A complex procedure was set up to chose $w$ to maximize the average Relative Rank within the set of predicted exons in the MR-set. The Relative Rank of the set of exons predicted for a given sequence is the inverse of the ratio of the sum of the ranks of the actual exons among the predicted ones when sorted by score, over the minimum sum of ranks for such set of actual exons. The optimal $w$ in the MR dataset was found to be $w=0.95$, meaning that most of the contribution to the exon score came from the sites score. After GASP1 we realized that the optimization criteria was wrong, because it did not distinguish between false predicted exons overlapping actual exons, and predicted exons fully contained within non-coding sequences. Thus, we decided to optimize $w$ at the gene prediction level as we did with $IW$ (See Materials and Methods), and at the same time that $IW$. The optimal $w$ was found to be now, $W=0.4$, instead. This is the scoring schema used to obtain the current predicitions (Table 2). Actually, the gain of using this weighted sum to score the exons, instead of the scoring function in equation (6) is only marginal, and we are considering reverting to this, more natural, function.

Although currently fully functional, we are still further developing geneid. Our short term plans include, among others, to train geneid to predict genes in the human and the Arabidopsis genomes, and to include the possibility of incorporating the results of database searches--both EST and proteins--in the geneid prediction schema, which can be done rather naturally. The possbility of including external evidence to ``force'' known genes or exons into the prediction is already included in the working version of geneid. This may be useful for reannotation of very large genomic sequences. Finally, the current structure of geneid can be highly paralelized, and we are also working in this direction.

Aknowledgments

We thank Josep F. Abril and Moisès Burset for helpful discussions and constant encouragement. This work was supported by grant from ``Plan Nacional de I+D'', BIO98-0443-C02-01, ``Ministerio de Educación y Ciencia (Spain)''.

Bibliography

Borodovsky, M. and McIninch, J. (1993).
Genmark: Parallel gene recognition for both DNA strands.
Computer and Chemistry, 17:123-13.

Burge, C. B. and Karlin, S. (1998).
Finding the genes in genomic DNA.
Current Opinion in Structural Biology, 8:346-354.

Claverie, J. M. (1997).
Computational methods for the identification of genes in vertebrate genomic sequences.
Human Molecular Genetics, 6:1735-1744.

Guigó, R. (1998).
Assembling genes from predicted exons in linear time with dynamic programming.
Journal of Computational Biology, 5:681-702.

Guigó, R. (1999).
DNA composition, codon usage and exon prediction.
In Bishop, M., editor, Nucleic Acid and Protein Databases, pages 53-80. Academic Press.

Guigó, R., Agarwal, P., Abril, J., Burset, M., and Fickett, J. (2000).
Gene prediction accuracy in large DNA sequences, submitted.

Guigó, R., Knudsen, S., Drake, N., and Smith, T. F. (1992).
Prediction of gene structure.
Journal of Molecular Biology, 226:141-157.


Table 1: Testing different models of coding DNA in the training semi-artificial genomic sequence. CU is a Codon Usage Model. DIA+CP is a combination of a Markov Model of order 1 of the translated amino acid sequence and of a Codon Preference Model. MM-5 is a Markov Model of order 5. Genes have been predicted using geneid but in each case exons have been scored on the basis solely of the coding DNA model--ignoring the contribution of the exon defining sites. Predicted genes have been compared with the annotated ones, and the usual measures of accuracy computed. For comparison purposes, results obtained when exons are scored as a function only of the scores of the defining sites are also given (Sites-PWM). Finally, we report the results on accuracy when the exons are scored as the sum of the Markov Model score and the scores of the exon defining sites. This is the scoring schema used by geneid when attempting to predict genes in the Adh region.
  Base level Exon level        
  Sn Sp CC Sne Spe SnSp ME WE        
Sites-PWM 0.23 0.65 0.37 0.17 0.13 0.15 0.72 0.79        
CU 0.91 0.88 0.88 0.46 0.43 0.45 0.21 0.27        
DIA+CP 0.91 0.88 0.89 0.46 0.46 0.46 0.23 0.25        
MM-5 0.93 0.90 0.91 0.54 0.51 0.52 0.18 0.24        
PWM & MM-5 0.92 0.92 0.92 0.75 0.71 0.73 0.12 0.18        



Table 2: Accuracy of geneid in the Adh region. The std1 annotation data set was used to evaluate sensitivity, and the std3 annotation data set to evaluate specificity, as in GASP1 (see Reese et al., in this issue). Discrepancies between the accuracy of the submitted predictions--both the initial ones (1) and the corrected (2), and the accuracy of the predictions obtained with the current version of geneid are due to a number of errors during the process of generating the submitted predictions (see discussion). The decrease in the amount of memory required to obtain the predictions is due to algorithmic developments occurred after GASP1.
  Base level Exon level 22cm
CPU Time
22cm
Memory
  Sn Sp Sn Sp ME WE    
  (Std1) (Std3) (Std1) (Std3) (Std1) (Std3)    
geneid, Submitted(1) 0.48 0.84 0.27 0.29 54. 4 47.9 74 s $\sim $500 MB
geneid, Submitted(2) 0.86 0.82 0.59 0.34 21. 0 48.0 74 s $\sim $500 MB
geneid, Current 0.96 0.92 0.70 0.62 11.0 17.0 83 s 18.11 MB


Figure: Predictions obtained by geneid in the region 462500-477500 from the Adh sequence, compared with the annotation in the standard std3 set. In a first step, geneid identifies and scores all possible donor (blue) and acceptor (yellow) sites, start codons (green) and stop codons (red) using Position Weight Matrices--the height of the corresponding spike in the figure proportional to the site score. 4,704 sites were generated along this 15,000 bp region by geneid, only the highest scoring ones displayed here. In a second step, geneid builds all exons compatible with these sites. 11,967 exons were built in this particular region (not displayed in the figure). Exons are scored as the sum of the scores of the defining sites, plus the score of their coding potential measured according with a Markov Model of order five. In the figure, the coding potential is displayed along the DNA sequence ( MM_score). Regions strong in red are more likely to be coding than regions strong in blue. From the set of predicted exons, the gene structure is generated maximizing the sum of the scores of the assembled exons. In the figure, exons assembled into the predicted genes are drawn with a height proportional to its score. A two color code is used to indicate frame compatibility: two adjacent exons are frame compatible if the right half of the upstream exon (the remainder) matches the color of the left half of the downstream exon (the frame). The figure has been obtained using the gff2ps program (available at http://www1.imim.es/$\sim $jabril/GFFTOOLS/GFF2PS.html). The input GFF and the configuration files required for gff2ps to generate this figure can be found at http://www1.imim.es/$\sim $gparra/GASP1.
\framebox {
\includegraphics[height=16cm,angle=-90]{Adh_462500-477500.ps}}

About this document ...

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -show_section_numbers main

The translation was initiated by Roderic Guigo Serra on 2000-02-07


Footnotes

... Guig\'o% latex2html id marker 1024
\setcounter{footnote}{1}\fnsymbol{footnote}
To whom correspondence should be addressed

next_group up previous
Roderic Guigo Serra
2000-02-07