geneid documentation: 7. Improving predictions by using sequence homology information

Table of contents:

geneid provides a simple support to the usage of homology information, either genomic homology (DNA - DNA) or protein homology (Protein - DNA translated). Homology information extracted from the aligment between input sequence and others (well-annotated) can be used to improve the degree of accuracy obtained ab initio, increasing the final score on the set of predicted exons overlapping the shared regions in such alignment (High-score Segment Pairs).

The information from aligments is read from an additional file in gff format as a list of HSPs or SRs. geneid is instructed to read this file with the command line option -S filename. There is no need to sort the list of gff records.

The frame can be either set in the column 8 of gff format or skipped by using the wildcar "." when is unknown. In the last case, geneid generates 3 equivalent elements (one per possible frame). This is necessary when aligments have been produced between 2 genomic sequences (DNA to DNA) and therefore there is no frame associated to.

The group field (column 9 in gff format) can be present but will be omitted when storing this information.

HSPs and SRs:
Given an blast aligment, a set of High-score Segment Pairs (HSPs) are produced to represent parts common in both sequences. Given the interesting sequence (input to geneid), numerous pairwise alignments can be done by using sequences related to the original by many causes such as containing an homologous gene (syntenic gene prediction).

geneid will increase proportionally the score of predicted exons which overlap the HSPs obtained by previous pairwise aligments. To sum up the information coming from several aligments, Similarity Regions (SRs) can be employed. SRs are the intersection (projection) of HSPs from several alignments which use the same original sequence and another one. SRs are computed taking always the highest score among the HSPs being in every position of the input sequence.

Frame definition in blast and geneid
It is important to understand that frame definition in blast and geneid is different and therefore a simple translation from one into another format is needed. Altough this conversion is implemented inside geneid, it might be necessary to keep it in mind to analyze and understand the new results.

In blast, there are 3 frames (1,2 or 3) corresponding to the mathematical modulus operation computed following this formula:

frame = HSP starting position modulus 3 (0 corresponds to frame 3).

i.e. Given a sequence which its first position is called 1, one HSP which starts at the position 7 is said to be in frame 1 (7 modulus 3 = 1), while a HSP which starts at the position 9 is said to be in frame 3 (9 modulus 3 = 0 and 0 is coded by 3).

IMPORTANT: HSPs might appear in the negative (reverse) strand of the input sequence. In that case, the HSPs coordinates must be translated into coordinates regarding to the reverse reading direction before computing the frame.

In geneid, the frame is defined as the number of nucleotides (0,1,2) from the first nucleotide in the exon sequence to the the first nucleotide in the first COMPLETE codon translated from the genomic sequence (see section Frame and Remainder (chapter 2)).

Obviously, it is necessary a method to convert geneid frames (exons) into blast frames (HSPs/SRs) to use this information properly. The following formula makes both types of frame definitions compatible

Given an exon and a SR (both in the same strand):

blastFrame(SR) = (starting(exon) + geneidFrame(exon)) % 3;
(result 0 corresponding to blastFrame 3)

remembering blastFrame = [1,2,3] and geneidFrame = [0,1,2].

Enrique Blanco Garcia © 2001