|geneid documentation:||7. Improving predictions by using sequence homology information|
|Table of contents:|
geneid provides a simple support to the usage of homology information, either genomic homology (DNA - DNA) or protein homology (Protein - DNA translated). Homology information extracted from the aligment between input sequence and others (well-annotated) can be used to improve the degree of accuracy obtained ab initio, increasing the final score on the set of predicted exons overlapping the shared regions in such alignment (High-score Segment Pairs).
The information from aligments is read from an additional file in
gff format as a list of HSPs or SRs. geneid
is instructed to read this file with the command line option -S filename.
There is no need to sort the list of gff records.
The frame can be either set in the column 8 of gff format or
skipped by using the wildcar "." when is unknown. In the last case,
geneid generates 3 equivalent elements (one per possible frame).
This is necessary when aligments have been produced between 2 genomic
sequences (DNA to DNA) and therefore there is no frame associated to.
The group field (column 9 in gff format) can be present but will be omitted when storing this information.
|HSPs and SRs:|
Given an blast aligment, a set of High-score Segment Pairs (HSPs) are produced to represent parts common in both sequences. Given the interesting sequence (input to geneid), numerous pairwise alignments can be done by using sequences related to the original by many causes such as containing an homologous gene (syntenic gene prediction).
geneid will increase proportionally the score of predicted exons
which overlap the HSPs obtained by previous pairwise aligments. To sum up
the information coming from several aligments, Similarity Regions (SRs) can
be employed. SRs are the intersection (projection) of HSPs from several
alignments which use the same original sequence and another one. SRs are
computed taking always the highest score among the HSPs being in every
position of the input sequence.
|Frame definition in blast and geneid|
It is important to understand that frame definition in blast and geneid is different and therefore a simple translation from one into another format is needed. Altough this conversion is implemented inside geneid, it might be necessary to keep it in mind to analyze and understand the new results.
In blast, there are 3 frames (1,2 or 3)
corresponding to the mathematical modulus operation computed following this
i.e. Given a sequence which its first position is called 1, one HSP which starts
at the position 7 is said to be in frame 1 (7 modulus 3 = 1), while a HSP
which starts at the position 9 is said to be in frame 3 (9 modulus 3 = 0
and 0 is coded by 3).
IMPORTANT: HSPs might appear in the negative (reverse) strand of the
input sequence. In that case, the HSPs coordinates must be translated into
coordinates regarding to the reverse reading direction before computing the
In geneid, the frame is defined as the number of nucleotides (0,1,2)
from the first nucleotide in the exon sequence to the the first nucleotide
in the first COMPLETE codon translated from the genomic sequence
(see section Frame and Remainder (chapter 2)).
Obviously, it is necessary a method to convert geneid frames (exons)
into blast frames (HSPs/SRs) to use this information properly. The
following formula makes both types of frame definitions compatible
Given an exon and a SR (both in the same strand):
blastFrame(SR) = (starting(exon) + geneidFrame(exon)) % 3;
remembering blastFrame = [1,2,3] and geneidFrame = [0,1,2].
Enrique Blanco Garcia © 2001