GENCODE Dataset, v2.2


(current full set of GENCODE annotations is available at: ftp://genome.imim.es/pub/other/gencode/data/havana-encode/current )

Reference Human Genome build: hg17/NCBI 35.

!!! READ THIS FIRST !!! Known problems of this release !!! READ THIS FIRST !!!


1. Annotations (GFF)

- README

Reference gene set

Contains exons, introns and coding exons from transcripts belonging to the following VEGA categories:
- 'Known',
- 'Putative' and 'Novel' (only when experimentally validated)
as well as verified exon pairs from computational gene predictions.
chr. coordinates
ENCODE coordinates

For each transcript:

'exon' and 'CDS' coordinates relative to primary transcript
'exon' and 'CDS' coordinates relative to mRNA
'CDS' coordinates relative to CDS start codon (on mRNA sequence)
'CDS' coordinates relative to corresponding AA sequence start

All transcripts

(Contains ALL ANNOTATIONS, including pseudogenes and putative genes)
chr. coordinates
ENCODE coordinates

For each transcript:

'exon' and 'CDS' coordinates relative to primary transcript
'exon' and 'CDS' coordinates relative to mRNA
'CDS' coordinates relative to CDS start codon (on mRNA sequence)
'CDS' coordinates relative to corresponding AA sequence start

Coding transcripts only

chr. coordinates
ENCODE coordinates

Projections

(from the Reference gene set)

Projected exons

ENCODE coordinates
chr. coordinates

Projected CDSs

ENCODE coordinates
chr. coordinates

2. Sequence sets (FASTA)

(Note: in all cases, nucleotide sequences are reverse-complemented when strand is '-')

Loci (README)
Primary transcripts (i.e. including introns, ~76MB file)
Spliced transcripts (mRNAs)
Exons (One fasta record per gencode exon)
CDS nucleotide sequences (README)
CDS AA sequences (README)



Contact: Julien Lagarde (jlagardeatimim.es)