Genes and Genomes

Written by Sergi Castellano


In what follows, we revise the key concepts regarding the eukaryotic gene and genome structure needed to understand the annotation of genomes. In this context, annotation refers to the description and location of genes and other biologically relevant features of a genomic sequence. Our main goal is to fully comprehend what genome annotation projects offer and, just as important, what they do not yet provide.

In this regard, it's worth noting that current gene prediction programs, among other bioinformatics tools, systematically ignore the complexity of eukaryotic gene structure. Diversity comes from alternatively spliced genes, non-canonical signals (that affect either splicing or translation) and from regions that control gene transcription, promoters, which are not yet well understood. Here, we give and overview to see some of the limitations and future directions in the gene prediction field.


The genome is the genetic material of an organism, that is, the total amount of DNA in the cell. In eukaryotes, it is usually organized into a set of chromosomes, which are extremely long chains of DNA that are highly condensed. In the picture below, human DNA is shown packaged into chromosome units (as seen during mitotic metaphase). Note the sister chromatids (that contain identical daughter DNA molecules), centromeres and telomeres.

Human chromosomes

It is time to introduce the three main genome browsers and check which genomes they are serving. In this practical, we will stick to the Ensembl server, but feel free to browse the others later on.


  1. How many species does Ensembl provide annotation for?
  2. What is the Trace server? Hint: how are genomes sequenced?
  3. What is the EnsMart service? Hint: be patient! we will use in the next lesson
  4. What is the difference between the BLAST and SSAHA programs? Hint: think of a strict blast search

Now, select human and make sure you understand the main concepts in this page. Contrary to the view reported in newspapers, the genome sequence and its corresponding annotations are highly dynamic. There are two levels at which data can be updated: 1) sequence level and 2) annotation level.


  1. How many releases has Ensembl provided annotation for?
  2. What are the main reasons that drive the release of genome versions?
  3. How many genes, exons and nucleotides does the last genome version have?
  4. Do you think that, over the same sequence, Ensembl, NCBI and Golden Path provide the same annotation?
  5. What is the Vega annotation? Do you think it is more reliable than the default Ensembl annotation?


DNA molecules consist of two anti-parallel chains held together by complementary base pairs that form a double helix. This structure is of major importance for computational analysis and, to a great extend, determines how it can be performed:

  1. The digital nature of the sequence (nucleotides: Adenine, Guanine, Cytosine and Thymine) permits an easy and symbolic computational representation as A, G, C and T letter codes, respectively. It is worth knowing that Uracil (U), which is in place of Thymine in RNA, is also written as T in sequence databases.

  2. The double helical nature of DNA, gives us two different sequences to analyze (with distinct encoded information). In order to handle such dual data, the concept of forward or positive (+) and reverse or negative (-) strands and elements (genes, exons, introns...) is introduced. The forward strand, for us, is simply the original sequence we are working on. Note that this concept is meaningless in the cell, so no differences are made between strands in the cell. For example, genes are transcribed from both chains.

  3. The complementary nature of the two strands (A-T, G-C base-pairing), permits to work in the computer only with one strand (forward), the other (reverse) being conceptually retrieved when needed. Usually, genome projects only provide one sequence strand (forward) and forward and reverse elements are annotated on it, the latter simply being tagged as reverse. Usually, again, note that the cell has access to both strands at the same time.

  4. The anti-parallel nature of the double helix (due to 5' and 3' nucleotide ends), gives polarity to the strands. There is a general agreement to write DNA sequences from 5' to 3' (do not confuse this fact with the forward and reverse concept). The 5' region is also known as the upstream region and, therefore, the 3' region is called the downstream region of the sequence. Be aware that the upstream and downstream concepts have nothing to do with the cis and trans relationship between biological elements, such as trancriptor factors and their DNA binding site (see below).

  5. The triplet nature of the genetic code (a codon is 3 nucleotides long), permits to translate any potentially coding sequence in six different ways (frames). Three in the forward strand and 3 more in the reverse sense of the sequence. The finding of the right frame by the ribosome (it already knows the strand) is a challenging process to reproduce computationally.
DNA to chromosomes
From DNA to chromosomes

Note that, in this course, we only analyze DNA at the sequence level. So, let's now check it. Please, select the chromosome 21 in the karyotype plot.


  1. What is the difference between known and novel genes?
  2. What does SNP stand for?
  3. Why the chromosome short arm (p arm) is shaded? Hint: is there any feature annotated in these region?
  4. Is there a correlation between repeats and the GC content?

Now, let's search for a specific gene over the whole genome. In the Find window, lookup the alcohol dehydrogenase gene.


  1. How many entries do you get? Why?
  2. Do you think it is a good idea to search genes by keywords? hint: try "alcohol dehydrogenase"
  3. Select the first entry and display the gene report. In which chromosome is the gene?

However, before analyzing the alcohol dehydrogenase gene, we will take a look at the eukaryotic gene structure, processing and expression (see below).

Gene Expression: from DNA to RNA to Protein

Transcription, splicing and translation are the main processes that account for the expression of protein coding genes. Each step is directed by sequential and structural signals. In what follows, we describe from both the biological and computational point of view, how these sequence motifs are used to go from DNA to RNA to the final protein product.

The schema below, highlights these processing steps:

mRNA processing pathway

Locate the "Transcript Structure" section in the alcohol dehydrogenase gene report. Note the correspondence between levels of reported data and processing steps:

We will follow these links in the order shown, but first, move to the text below for a more precise discussion of the eukaryotic gene structure.


Transcription starts when a region upstream of the gene (promoter region) is activated (bound) by transcription factors. These regions, control whether a gene is transcribed from the forward or reverse strand. In any case, the strand which is actually transcribed is called template or sense strand and the other, nonsense or antisense strand.

promoter region
Promoter region

In short, transcription is the copying of DNA (template strand) to RNA (pre-mRNA). However, when analyzing mRNA, cDNA or EST data, bear in mind that the mRNA to be translated is, in sequence, identical to the coding strand (coding here always refers to translation, and not to transcription). That is, the mRNA is transcribed from the strand that has its complementary sequence. In conclusion, when annotating genomes, genes are annotated in relation to their coding strand.

There are three main types of transcript data:

  1. mRNA: messenger RNA

  2. cDNA: a double-stranded copy, usually a fragment, of an mRNA molecule

  3. EST: expressed sequence tag. A short single-pass sequencing of a cDNA clone. It is tipically a fragment from the 5' or the 3' end of the cDNA.
The copy of the template strand

Cis and Trans Elements

The transcription process brings us to two relevant terms in relation to the study of gene regulation: cis and trans elements. A locus is a cis-acting element if it must be on the same DNA molecule in order to have its effect. Transcription factor binding sites are a good example of cis-acting regulatory elements. A locus is trans-acting if it can effect a second locus even when on a different DNA molecule. Transcription factors are a good example of trans-acting regulatory elements. Note, again, that these terms are distinct from the notions of upstream and downstream region explained above.

Another biological example is the so-called trans-splicing, where exons from different transcripts are spliced and joined together. That is, elements from independent sequences end up acting together in the same mature mRNA.

Gene Structure

Eukaryotic genes are short DNA stretches within a genome with a peculiar and discrete structure.

Schematic representation of a two exon eukaryotic gene on a DNA sequence

Gene prediction programs make use of this structure to find genes in a genome. The main characteristics are:

Follow the Exon information link.


  1. How many exons has this gene?
  2. Are the exons completely coding?
  3. What is the difference between upstream/downstream, UTRs and intron sequences?
  4. Can you spot any common pattern on the start/end of introns?
  5. What is the supporting evidence for this gene? Would you trust it?


Splicing is an RNA-processing step in which introns in the primary transcript are removed. Splicing signals, GT (donor) and AG (acceptor) in the intron region, are used to delimit exon-intron boundaries, so that exons (coding and non-coding ones) are joined together. In this way, the open reading frame sequence along with the 5' and 3' Untranslated Regions (UTRs) are ready to be processed by the ribosome.

The spliceosome complex removes
intron sequences (the exons are spliced together)

Follow the Transcript information link.


  1. How many different transcripts does this gene have?
  2. How many different proteins does this gene produce?
  3. Find out how to highlight the coding and non-coding regions (UTR) in the transcript
  4. Can you think of a biological role for UTRs?
  5. Check that start and end of the coding region have the right signals.
  6. Which of the three stop codons does this mRNA have?
  7. How many SNPs are annotated? in the UTR? in the CDS? Is it a synonymous change?


In translation the mature mRNA sequence is translated into a protein. Again, the ribosomal machinery is guided by several signals along the mRNA sequence to find the right open reading frame (ORF) and to determine where the translation should terminate.

Translation maps RNA to proteins, from a 3 letter code to a 1 letter code

Follow the Protein information link.


  1. What is the length of the protein?
  2. Are all exons of similar length?
  3. In which chromosomes are other members of this family?
  4. Is there any known protein domain annotated?
  5. Can you get the signature for this domain?
  6. Take a look at other genes with this domain.
  7. What is the function of this protein? Do you think this gene is essential for you?
  8. Look for the Gene Ontology in the Gene Report page.

Let's now try to get a deeper insight into the biological role of this gene. We will connect to a couple of web-based resources to learn more about our gene and protein. GeneCards is a database of human genes, their products and their involvement in diseases. Connect to this database and search GeneCards by symbol using the HUGO id corresponding to the adh gene: AKR1A1 (as shown in the Ensembl gene report).


  1. Check the gene expression in several human tissues. What is an electronic northern?
  2. Do you think that, given the number of clones, the electronic northern is significant?
  3. Can you list any disorders and mutations in which this gene is involved?
  4. What is its cellular location? Hint: go to the GenAtlas database

Finally, we will try to get other possible identifiers for this gene in several databases. Connect to GeneLynx and do a quick search in human for the HUGO ID: AKR1A1.


  1. What is the SwissProt ID?
  2. What is the PDB ID? Which method was used to characterize this structure? at which resolution?
  3. Can you get a picture of the assumed biological protein at PDB?
  4. Can you get the metabolic pathway in which this gene is involved? Hint: go to KEGG pathway

Data Integration

It is about time to fully use the main feature of genome browsers: the ability to display all available information along a genomic region of interest. However, first take a look at the picture below to make sure you understand the pipeline behind these data.

data integration
Genome analysis: from sequencing to annotation

We will browse the genomic region of the adh gene that we are working with. From the Gene Report page, follow the link with the genomic location. There are four levels of resolution:

  1. Chromosome view
  2. Overview
  3. Detailed view
  4. Basepair view


The mouse and rat synteny tracks above call for a brief discussion of this concept. Although, historically, synteny means "in the same strand" and syntenic genes are those in such a disposition, but syntenic regions between genomes are understood as regions in which gene order is conserved and, therefore, syntenic genes are putative homologues that have an orthologous, paralogous or even xenologous relation. See below for a short definition of these key but often misinterpreted concepts.

Go to Chromoview in the Ensembl web-site by clicking on a chromosome. From here we can link to Synteny view, which offers a chromosomal view of the synteny between genomes.

Homology: Orthology, Parology and Xenology

Homologous genes are genes that are related through a common evolutionary ancestor. Homology is usually inferred on the basis of sequence similarity but bear in mind that, through random and convergent evolutionary processes, biological sequences can share a reasonable degree of similarity without a true evolutionary relationship. In addition, it is incorrect to say that a pair of related genes are, for example, 80% homologous, because genes are either evolutionarily related or not. On the other hand, one can speak of a percentage of sequence similarity between genes.

This is of importance, for instance, when reading a blast output and deriving evolutionary implications. The score and the e-value are based on the similarity between sequences. However, this does not necessarily ensure a close phylogenetic relationship, although it suggests one.

Orthology, paralogy and xenology are homology subtypes. That is, they define a specific type of relationship between genes over space and time. Read these definitions carefully:

This paper discusses these concepts in more detail.

Alternative Transcription

The transcription start site, can vary in the same gene depending on how the promoter region is activated. This results in a different pre-mRNA and, potentially, a differentially expressed mRNA or even a distinct protein may be achieved.

Alternative Splicing

Briefly, alternative splicing is an important cellular mechanism that leads to temporal and tissue specific expression of unique mRNA products. This is accomplished by the usage of alternative splice sites that result in the differential inclusion of RNA sequences (exons) in the mature mRNA.

Alternative splicing
Alternative splicing produces unique mRNA products

In general, current gene prediction programs cannot predict alternative mRNAs in a reliable way, unless transcriptional data (mRNAs, cDNAs and ESTs) are available.

Alternative Translation

Translation by the ribosome is a complex process. Sources of variability are:

  1. In the same mRNA, alternative translation start sites (ATG) can be used. Furthermore, translation can even start at GTG (valine), TTG (leucine), ATT (isoleucine) but they still code for methionine when they function as an initiator codon.

  2. Alternative decoding (recoding):

  3. In the same mRNA, alternative poly-A sites modify the 3' UTR region.

LINUX and Genome Annotation

Try to complete this practical.

FundaciĆ³ La Caixa