geneid documentation: 2. Signals, exons and genes


Table of contents:


General context:
geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. First of all, splice sites and start/stop codons are predicted and scored along the DNA sequence by using Position Weight Arrays (PWA). Secondly, exons are built from the set of predicted signals before. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model to distinguish between coding and non-coding DNA. Finally, from the set of predicted exons, the best gene structure is assembled maximizing the sum of the scores of the assembled exons.


Signals:
geneid is able to detect the following signals:
  • Acceptor splice sites: Last intron dinucleotide mandatory AG.
    Acceptors occur at the limits between intron/exon.
    The following signals associated to the acceptors are also detected (optional):
    1. Branch point
    2. Poly Pyrimidine Tract
  • Donor splice sites: First intron dinucleotide mandatory GT.
    Donors occur at the limits between exon/intron.
  • Start codons (translation): First exon first trinucleotide mandatory ATG.
    Initiation of transcription from mRNA into protein.
  • Stop codons (translation): Three possible codons TAA, TAG and TGA.
    End of transcription from mRNA into protein.


Frame and remainder:
geneid exons are Open Reading Frames delimitited by signals. Frame and remainder are 2 values assigned to exons which have to be understood in the following way:

FRAME: The number of nucleotides (0,1,2) from the first nucleotide in the exon sequence to the first nucleotide in the first COMPLETE codon within the exon.

REMAINDER: The number of nucleotides left (0,1,2) after the last COMPLETE codon has been defined from the exon sequence, given its frame. Therefore, exon remainder is completely determined by the length of the exon and its frame.



Exons:
geneid is able to build the following type of exons:
  • First exons: Delimited by Start and Donor.
    Frame = 0 - Remainder = 0,1,2
  • Internal exons: Delimited by Acceptor and Donor.
    Frame = 0,1,2 - Remainder = 0,1,2
  • Terminal exons: Delimited by Acceptor and Stop.
    Frame = 0,1,2 - Remainder = 0
  • Single genes: Delimited by Start and Stop.
    Frame = 0 - Remainder = 0
NOTE: The character 'X' denotes unknown codons in the translation process
(such those cases in which there are masked nucleotides, "N", within the codon sequence).


Genes:
Gene structures are series of frame compatible exons that are assembled following some fixed rules and restrictions specified in a Gene Model. Two exons are frame compatible if the frame of the second exon plus the remainder of the first one MODULUS 3 is 0. Predicted exons are connected as long as frame compatibility rule and gene model restrictions are respected properly.




Enrique Blanco Garcia © 2003