Genome Informatics Research Lab

  IMIM * UPF * CRG * GRIB HOME SOFTWARE SGP2 Algorithm
 
SGP2 Algorithm Description
 
 

 

sgp-2 is a program to predict genes by comparing anonymous genomic sequences from different species. It combines tblastx, a sequence similarity search program, with geneid, an "ab initio" gene prediction program (http://www.sanger.ac.uk/Software/formats/GFF/). In "assymetric" mode, genes are predicted in one sequence from one species (the target sequence), using a set of sequences (maybe only one) from the other species (the reference set). Essentially, geneid is used to predict all potential exons along the target sequence. Scores of exons are computed as log-likelihood ratios, function of the splice sites defining the exon, and of the coding bias in composition of the exon sequence as measured by a Markov Model of order five. From the set of predicted exons, geneid assembles the gene structure (eventually multiple genes in both strands) maximizing the sum of the scores of the assembled exons, using a dynamic programming chaining algorithm. When using a reference set of sequences, ideally we would like to incorporate into the scores of the candidate exons, the score of the optimal alignment at the amino acid level between the target exon sequence and the counterpart homologous sequene in the reference set. If a substitution matrix, for instance from the blosum family, is used to score the aligment, the resulting score can also be assumed to be a log-likelihood ratio: informally, the ratio between the likelihood of the aligment when the amino acid sequences code for proteins functionally related, and the likelihood of the alignment, otherwise. In principle, thus, this score could be naturally added to the geneid score for the exon. tblastx provides an appropiate shortcut to find often a good enough approximation to such an optimal aligment, and infer the corresponding score: the optimal alignment can be assumed to correspond to the maximal scoring HSP overlapping the exon. However, when dealing with the fragmentary nature of the mouse shotgun genomic sequence, often different regions of a candidate exon sequence will align optimally to different mouse shotgun sequences. Thus, in the approach used here, we identify the optimal HSPs covering each fraction of the exon, and compute separately the contribution of each HSP into the score of the exon. In the next section, we describe in detail how this computation is performed.

Scoring of candidate exons

Let $e$ be one of the candidate exons predicted by geneid along the a query DNA sequence $S$. In sgp-2 the final score of $e$, $s(e)$, is computed as

\begin{displaymath}s(e) = s_g(e) + w s_t(e)\end{displaymath}

where $s_g(e)$ is the score given by geneid to the exon $e$, and $s_t(e)$ is the score derived from the HSPs found by a tblastx search overlapping the exon $e$. Both scores are log-likelihood ratios (and we compute both base two). Assuming independence between their distribution, they could be summed up into a single likelihood ratio. However, the assumption of independence is not realistic, $s_g(e)$ depends on the probability of the sequence of $e$, assuming that $e$ codes for a protein, while $s_t(e)$ depends on the probability of the optimal aligment of $e$ with a sequence fragment of the mouse genome, assuming that both sequences code for related proteins. Obviously, these two probabilities are not independent. Their joint distribution could only be investigated-at least empirically-if the Markov Model of coding DNA used in geneid, and the substituion matrix used by tblastx were inferred from the very same set of coding sequences. Since this is not the case, we use an ad-hoc coefficient, $w$, to weight the contribution of tblastx search, $s_t(e)$ into the final exon score. We compute $s_t(e)$ in the following way. Let $h_1 \cdots h_q$ be the set of HSPs found by tblastx after comparing the query sequence $S$ against a database of DNA sequences (figure [*], A). First, we find the maximum scoring projection of the HSPs onto the query sequence. We simply register the maximum score among the scores of all HSPs covering each position, and then partition the query sequence in equally maximally scoring segments $x_1 \cdots x_r$, with scores $s_p(x_1) \cdots
s_p(x_r)$ (figure [*], B). Then, for each predicted exon $e$ (figure [*],C), we find $X_e$ the set of maximally scoring segments overlapping $e$
\begin{displaymath}
X_e = \{ x_i : x_i \cap e \ne \emptyset \}
\end{displaymath}

where $a \cap b$ denotes the overlap betwen sequence segments $a$ and $b$, and $\emptyset$ means no overlap. We compute $s_t(e)$ in the following way:
\begin{displaymath}
s_t(e) = \sum_{x \in X_e} s_p(x) \frac{\vert x \cap e\vert}{\vert x\vert}
\end{displaymath}

where $\vert a\vert$ denotes the length of sequence segment $a$. That is, each exon gets the score of the maximally scoring HSPs along the exon sequence proportional to the fraction of the HSP covering the exon. In other words, $s_t(e)$ is the integral of the maximum scoring projection function within the exon interval. Once the scores $s$ have been computed for all predicted exons in the sequence $S$, gene prediction proceeds as usual in geneid: the gene structure is assembled maximizing the sum of scores of the assembled exons.

Running sgp-2

In the practice, we run sgp-2 in the following way. Given a DNA query sequence (here, human chromosome sequences), and a database of DNA sequences (here, shotgun mouse genomic sequences), we run the query sequence against the databse using tblastx. Results of tblastx search are parsed to obtain the maximum scoring projection of the HSPs onto the query sequence using a simple perl script. The parsing includes discarding all HSPs below a given bit score cutoff, substrating this value from the score of the remaining HSPs, and weighting the resulting score by $w$ (see above). The maximum scoring projection is given to geneid in GFF format `General Feature Format' (Durbin and Haussler, http://www.sanger.ac.uk/Software/formats/GFF/). geneid uses it to rescore the exons predicted along the query sequence as explained, and assembles the corresponding optimal gene structure.

Figure: Rescoring of the exons predicted by geneid according with the results of a tblastx search
(To enlarge the figure, click right mouse button and select "View Image" from panel).
\includegraphics[bb= 87 120 502 718,clip,angle=270]{algo1.ps}

 
  Disclaimer webmaster