Genome BioInformatics Research Lab



The GENCODE Pipeline


HAVANA manual annotations:

Finished genomic sequence is analysed on a clone by clone basis using a combination of similarity searches against DNA and protein databases as well as a series of ab initio gene predictions. Nucleotide sequence databases (dbEST, release 24th August 2004; Vertebrate RNA Database, version 80 of EMBL) are searched with WUBLASTN (parameters used: 'cpus=1 E=1e-4 B=100000 Z=500000000 -hitdist=40'), and significant hits are re-aligned to the unmasked genomic sequence (only sub-regions which includes those hits, plus a border of 1000bp either side) by EST2GENOME. The Uniprot (version 2.4) protein database is searched with WUBLASTX (parameters used: 'cpus=1 E=1e-4 B=100000 Z=500000000 -hitdist=40 -wordmask=seg'), and the accession numbers of significant hits are looked up in the Pfam database (version 15.0). The hidden Markov models for Pfam protein domains are aligned against the genomic sequence using Halfwise (parameters used: '-pfam -dnas -ext 2 -subs 0.0000001 -quiet -genes -gap 12 -kbyte 100000'), to provide annotation of protein domains. We also run a number of ab initio prediction algorithms: Genscan and Fgenesh for genes, tRNAscan to find tRNA genes, and Eponine TSS which predicts transcription start sites. The annotators use the (AceDB-based) Otterlace interface to create and edit gene objects, which are then stored in a local database named Otter. Where predicted transcript structures from Ensembl are available these can be viewed from within the otterlace interface and may be used as starting templates for gene curation. Annotation in the Otter database is submitted to the EMBL/Genbank/DDBJ nucleotide database.


Gene objects selected for verification come from various computational prediction methods and HAVANA annotations. RT-PCR and RACE experiments were performed on them, using a variety of human tissues, to confirm their structure. Human cDNAs from 24 different tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta, skin, peripheral blood leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal heart, fetal lung, thymus, pancreas, mammary gland, prostate) were synthesized using 12 poly(A)+ RNAs from Origene, 8 from Clemente Associates/Quantum Magnetics and 4 from BD Biosciences as described in [Reymond et al., 2002, Nature; Reymond et al., 2002, Genomics]. The relative amount of each cDNA was normalized by quantitative PCR using SyberGreen as intercalator and an ABI Prism 7700 Sequence Detection System. Predictions of human genes junctions were assayed experimentally by RT-PCR as previously described and modified [Reymond, 2002 Genomics, Waterston, 2002, Nature, Guigo, 2003 PNAS]. Similar amounts of Homo sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and 4 ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50ºC; annealing temperature of the next 30 cycles was carried out at 50ºC. Amplimers were separated on "Ready to Run" precast gels (Pharmacia) and sequenced. RACE experiments were performed with the BD SMART RACE cDNA Amplification Kit following the manufacturer instructions (BD Biosciences).


Ashurst JL et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33, Database Issue:D459-65 (2005).

Guigo R et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100, 1140-5 (2003).

Reymond A et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420, 582-6 (2002).

Reymond A et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79, 824-32 (2002).

Waterston RH et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-62 (2002).

  Disclaimer GENCODE webmaster