GENCODE Dataset, v2.2
(current full set of GENCODE annotations is available at:
ftp://genome.imim.es/pub/other/gencode/data/havana-encode/current
)
Reference Human Genome build: hg17/NCBI 35.
!!! READ THIS FIRST !!!
Known problems of this release
!!! READ THIS FIRST !!!
1. Annotations (GFF)
-
README
Reference gene set
Contains exons, introns and coding exons from transcripts belonging to the following VEGA categories:
- - 'Known',
- - 'Putative' and 'Novel' (only when experimentally validated)
as well as verified exon pairs from computational gene predictions.
- chr. coordinates
- ENCODE coordinates
For each transcript:
-
'exon' and 'CDS' coordinates relative to primary transcript
- 'exon' and 'CDS' coordinates relative to mRNA
- 'CDS' coordinates relative to CDS start codon (on mRNA sequence)
- 'CDS' coordinates relative to corresponding AA sequence start
All transcripts
(Contains ALL ANNOTATIONS, including pseudogenes and putative genes)
- chr. coordinates
- ENCODE coordinates
For each transcript:
-
'exon' and 'CDS' coordinates relative to primary transcript
- 'exon' and 'CDS' coordinates relative to mRNA
- 'CDS' coordinates relative to CDS start codon (on mRNA sequence)
- 'CDS' coordinates relative to corresponding AA sequence start
Coding transcripts only
-
chr. coordinates
-
ENCODE coordinates
Projections
(from the Reference gene set)
Projected exons
- ENCODE coordinates
- chr. coordinates
Projected CDSs
- ENCODE coordinates
- chr. coordinates
2. Sequence sets (FASTA)
(Note: in all cases, nucleotide sequences are reverse-complemented when strand is '-')
- Loci (README)
- Primary transcripts (i.e. including introns, ~76MB file)
- Spliced transcripts (mRNAs)
- Exons (One fasta record per gencode exon)
- CDS nucleotide sequences (README)
- CDS AA sequences (README)
Contact: Julien Lagarde (jlagardeatimim.es)