Genes and Genomes
Written by Sergi Castellano


In what follows, we revised the key concepts regarding the eukaryotic gene and genome structure needed to understand the annotation of genomes. In this context, annotation refers to the description and location of genes and other biologically relevant features on a genomic sequence. Our main goal is to fully comprehend what genome annotation projects offer and, as important, what they do not yet provide.

In this regard, it's worth noting that current gene prediction programs, among other bioinformatics tools, systematically ignore the complexity of eukaryotic gene structure. Diversity comes from alternatively spliced gene structures, non-canonical signals (that affect either splicing or translation) and from regions that control gene transcription, promoters, which are not yet well understood. Here, we give and overview to see some of the limitations and future directions in the gene prediction field.

The genome is the genetic material of an organism, that is, the total amount of DNA in the cell. In eukaryotes, it is usually organized into a set of chromosomes, which are extremely long chains of DNA highly condensed. In the picture below, human DNA is shown packaged into chromosome units (as seen during mitotic metaphase). Note the sister chromatids (that contain identical daughter DNA molecules), centromeres and telomeres.

Human chromosomes

It is time to introduce the three main genome browsers and check which genomes they are serving. In this practical, we will stick to the Ensembl server, but feel free to browse the other later on.


  1. How many species is Ensembl providing annotation for?
  2. What is the Trace server? hint: how are genomes sequenced?
  3. What is the difference between BLAST and SSAHA programs? hint: think of a strict blast search
Now, select human and make sure you understand main concepts in this page. Contrary to the view reported in newspapers, the genome sequence and its corresponding annotations is highly dynamic. There are two levels at which data can be updated: 1) sequence level; and 2) annotation level.


  1. How many releases has Ensembl provided annotation for?
  2. What are the main reasons that drive the release of genome versions?
  3. How many genes, exons and nucleotides have the last genome version?
  4. Do you think that, over the same sequence, Ensembl, NCBI and Golden Path provide the same annotation?
  5. What is the Vega annotation? do you think it is more reliable than the default Ensembl annotation?

DNA molecules consist of two anti-parallel chains held together by complementary base pairs that form a double helix. This structure is of major importance for computational analysis and, to a great extend, determines how it can be performed:
  1. The digital nature of the sequence (nucleotides: Adenine, Guanine, Cytosine and Thymine) permits an easy and symbolic computational representation as A, G, C and T letter codes, respectively. It is worth knowing that Uracil (U), which is in place of Thymine in RNA, is also written as T in sequence databases.

  2. The double nature of the DNA helix, gives us two different sequences to analyze (with distinct encoded information). In order to handle such a dual data, the concept of forward or positive (+) and reverse or negative (-) strands and elements (genes, exons, introns...) is introduced. The forward strand, for us, is simply the original sequence we are working on. Note that this concept is meaningless in the cell, thus no differences are made between strands. For example, genes are transcribed from both chains.

  3. The complementary nature of the two strands (A-T, G-C base-pairing), permits to work in the computer only with one strand (forward), the other (reverse) being conceptually retrieved when needed. Usually, genome projects only provide one sequence strand (forward) and forward and reverse elements are annotated on it, being the latter tagged as reverse when suitable. Usually, again, note that the cell has access, at the same time, to the two strands.

  4. The anti-parallel nature of the double helix (due to 5' and 3' nucleotide ends), gives polarity to the strands. There is a general agreement to write DNA sequences from 5' to 3' (do not mix up this fact with the forward and reverse concept). The 5' region is also known as the upstream region and, therefore, the 3' region is also called the downstream region of the sequence.

  5. The triplet nature of the genetic code (a codon is 3 nucleotides long), permits to translate any potentially coding sequence in six different ways (frames). Three in the forward strand and 3 more in the reverse sense of the sequence. The finding of the right frame by the ribosome (it already knows the strand) is a challenging process to reproduce computationally.
DNA to chromosomes
From DNA to chromosomes

Note that, in this course, we only analyze DNA at the primary sequence level. So, let's now check it. Please, select the chromosome 21 in the karyotype plot.


  1. What is the difference between known and novel genes?
  2. What does SNP stand for?
  3. Why the chromosome short arm (p arm) is shaded? hint: is there any feature annotated in these region?
  4. Is there a correlation between repeats and the GC content?

Now, let's search for an specific gene over the whole genome. In the Find window, lookup the "alcohol dehydrogenase" gene.


  1. How many entries you get? Why?
  2. Select the first entry and display the gene report. In which chromosome is the gene?

However, before analyzing the alcohol dehydrogenase gene, we will take an overview on the eukaryotic gene structure, processing and expression (see below).

Gene expression: from DNA to RNA to protein
Transcription, splicing and translation are the main processes that account for gene expression of protein coding genes. Each step is directed by sequence and structural signals. In what follows, we describe from the biological and computational point of view, how these sequence motifs are used to go from DNA to RNA to the final protein product. The schema below, highlights these processing steps:

mRNA processing pathway

Locate in the alcohol dehydrogenase gene report the "Prediction Transcript" section. Note the correspondence between levels of reported data and processing steps:

  • Exon information (gene structure on the DNA sequence): sequence before transcription

  • Transcript information (mature mRNA sequence): sequence after splicing

  • Protein information: sequence after translation

We will follow these links in the order shown, but first, move to the text below for a more precise discussion of the eukaryotic gene structure.

Transcription starts when a region upstream of the gene (promoter region) is activated (bound) by transcription factors. These region, controls whether a gene is transcribed from the forward or reverse strand. In any case, the strand which is actually transcribed is called template or sense strand and the other, nonsense or antisense strand.

promoter region
Promoter region

In short, transcription is the copying of DNA (template strand) to RNA (pre-mRNA). However, when analyzing mRNA, cDNA or EST data, bear in mind that the mRNA to be translated is, in sequence, identical to the coding strand (coding here always refers to translation, and not to transcription). That is, the mRNA is transcribed from the strand that has its complementary sequence. In conclusion, when annotating genomes, genes are annotated in relation to their coding strand.

The copy of the template strand

Gene Structure
Eukaryotic genes are short DNA stretches within a genome with a peculiar and discrete structure.

Schematic representation of a two exons eukaryotic gene on a DNA sequence

Gene prediction programs make use of this structure to find genes on a genome. Main characteristics are:

  • Coding and non coding exons (UTRs)

  • Introns

  • Translation start site (ATG)

  • Splice sites (GT, donor and AG, acceptor)

  • Translation termination site (STOPs: TAG, TGA and TAA)

Follow the Exon information link.


  1. How many exons has this gene?
  2. Are exons completely coding?
  3. What is the difference between upstream/downstream, UTRs and intron sequences?
  4. Can you spot any common pattern on the start/end of introns?
  5. What is the supporting evidence for this gene? Would you trust it?

Splicing is an RNA-processing step in which introns in the primary transcript are removed. Splicing signals, GT (donor) and AG (acceptor) in the intron region, are used to delimit exon-intron boundaries, so that exons (coding and non-coding ones) are joined together. In this way, the open reading frame sequence along with the 5' and 3' Untranslated Regions (UTRs) are ready to be processed by the ribosome.

The spliceosome complex splices out
intron sequences

Follow the Transcript information link.


  1. Find how to highlight coding and non-coding region (UTR) in the transcript
  2. Can you think of a biological role for UTRs?
  3. Check that start and end of the coding region have the right signals?
  4. Which of the three stop codons has this mRNA?
  5. How many SNPs are annotated? in the UTR? in the CDS? is it a synonymous change?

In translation the mature mRNA sequence into a protein. Again, the ribosomal machinery is guided by several signals along the mRNA sequence to find the right open reading frame (ORF) and to know where translation should terminate.

Translation maps RNA to proteins through a 3 to 1 letters code

On the other hand, there are three types of transcript data:

  1. mRNA: messenger RNA

  2. cDNA: a full-length (not always) and high-quality copy of an mRNA sequence

  3. EST: expressed sequence tag. A partial and low-quality copy of an mRNA sequence

Follow the Protein information link.


  1. What is the length of the protein?
  2. Are exons of similar length?
  3. In which chromosomes are other members of this family?
  4. Is there any known protein domain annotated?
  5. Can you get the signature for this domain?
  6. Take a look at other genes with this domain?
  7. What is the function of this protein? do you think it is essential for you?

Let's try now to get a more deep insight into the biological role of this gene. We will connect to a couple of web-based resources to learn more about our gene and protein. GeneCards is a database of human genes, their products and their involvement in diseases. Connect to this database and search for symbol/alias the approved HUGO adh symbol: AKR1A1 (as shown in the Ensembl gene report).


  1. Does it have alternative transcripts? hint: go to GeneAtlas
  2. What is the Gene Ontology?
  3. Check gene expression in human tissues. What is an electronic northern?
  4. Do you think that, given the number of clones, the electronic northern is significant?
  5. Can you tell disorders and mutations in which this gene is involved?
  6. What is its cellular location? hint: go to GeneAtlas

Finally, we will try to get all the possible identifiers for this gene across several databases. Connect to GeneLynx and do a quick search in human for the HUGO ID: AKR1A1.


  1. What is the SwissProt ID?
  2. What is the PDB ID? Which method was used to characterize this structure? at which resolution?
  3. Can you get a picture of the assumed biological protein at PDB?
  4. Can you get the metabolic pathway in which this gene is involved? hint: go to KEGG pathway

Data integration
It is about time to fully use the main feature of genome browsers: the ability to display all available information along a genomic region of interest. However, take first a look at the picture below to make sure you understand the pipeline behind these data.

data integration
Genome analysis: from sequencing to annotation

We will browse the genomic region of the adh gene we are working with. Follow the this link. There are four levels of resolution:

  1. Chromosome view
    • In which arm of the chromosome 1 is the adh gene?

  2. Overview
    • Is there any novel gene in this region? and pseudogene?
    • What are the mouse and rat synteny tracks?

  3. Detailed view
    • Make sure you understand each track

  4. Basepair view
    • Take a look at the region translated in the 3 possible frames for each strand (forward and reverse)

Alternative Transcription
The transcription start site, can vary in the same gene depending on how the promoter region is activated. This results in a different pre-mRNA and, potentially, a differentially expressed mRNA or even a distinct protein may be achieved.

Alternative Splicing
Briefly, alternative splicing is an important cellular mechanism that leads to temporal and tissue specific expression of unique mRNA products. This is accomplished by the usage of alternative splice sites that results in the differential inclusion of RNA sequences (exons) in the mature mRNA.

Alternative splicing
Alternative splicing produces unique mRNA products

In general, current gene prediction programs cannot predict alternative mRNAs in a reliable way, unless transcriptional data (mRNAs, cDNAs and ESTs) are available.

Alternative Translation
Translation by the ribosome is a complex process. Source of variability are:
  1. In the same mRNA, alternative translation start sites (ATG) can be used. Furthermore, translation can even start at GTG (valine), TTG (leucine), ATT (isoleucine) but they still specify methionine when they function as an initiator codon.

  2. Alternative decoding (recoding):

    • In the ribosome, the meaning of specific codons can be redefined

    • The ribosome can alter the reading frame by switching from one overlapping reading frame to another

    • The ribosome can bypass a stretch of sequence, with or without, a change in the reading frame

  3. In the same mRNA, alternative poly-A sites, modify the final UTR region

LINUX and genome annotation
Try to complete this practical.