Genes and Genomes
Written by Sergi Castellano

Overview

In what follows, we revise the key concepts regarding the eukaryotic gene and genome structure needed to understand the annotation of genomes. In this context, annotation refers to the description and location of genes and other biologically relevant features of a genomic sequence. Our main goal is to fully comprehend what genome annotation projects offer and, just as important, what they do not yet provide.

In this regard, it's worth noting that current gene prediction programs, among other bioinformatics tools, systematically ignore the complexity of eukaryotic gene structure. Diversity comes from alternatively spliced genes, non-canonical signals (that affect either splicing or translation) and from regions that control gene transcription, promoters, which are not yet well understood. Here, we give and overview to see some of the limitations and future directions in the gene prediction field.


Genomes
The genome is the genetic material of an organism, that is, the total amount of DNA in the cell. In eukaryotes, it is usually organized into a set of chromosomes, which are extremely long chains of DNA that are highly condensed. In the picture below, human DNA is shown packaged into chromosome units (as seen during mitotic metaphase). Note the sister chromatids (that contain identical daughter DNA molecules), centromeres and telomeres.

chromosome
Human chromosomes

It is time to introduce the three main genome browsers and check which genomes they are serving. In this practical, we will stick to the Ensembl server, but feel free to browse the others later on.

Questions:

  1. How many species does Ensembl provide annotation for?
  2. What is the Trace server? Hint: how are genomes sequenced?
  3. What is the difference between the BLAST and SSAHA programs? Hint: think of a strict blast search
Now, select human and make sure you understand the main concepts in this page. Contrary to the view reported in newspapers, the genome sequence and its corresponding annotations are highly dynamic. There are two levels at which data can be updated: 1) sequence level and 2) annotation level.

Questions:

  1. How many releases has Ensembl provided annotation for?
  2. What are the main reasons that drive the release of genome versions?
  3. How many genes, exons and nucleotides does the last genome version have?
  4. Do you think that, over the same sequence, Ensembl, NCBI and Golden Path provide the same annotation?
  5. What is the Vega annotation? Do you think it is more reliable than the default Ensembl annotation?

DNA
DNA molecules consist of two anti-parallel chains held together by complementary base pairs that form a double helix. This structure is of major importance for computational analysis and, to a great extend, determines how it can be performed:
  1. The digital nature of the sequence (nucleotides: Adenine, Guanine, Cytosine and Thymine) permits an easy and symbolic computational representation as A, G, C and T letter codes, respectively. It is worth knowing that Uracil (U), which is in place of Thymine in RNA, is also written as T in sequence databases.

  2. The double helical nature of DNA, gives us two different sequences to analyze (with distinct encoded information). In order to handle such dual data, the concept of forward or positive (+) and reverse or negative (-) strands and elements (genes, exons, introns...) is introduced. The forward strand, for us, is simply the original sequence we are working on. Note that this concept is meaningless in the cell, so no differences are made between strands in the cell. For example, genes are transcribed from both chains.

  3. The complementary nature of the two strands (A-T, G-C base-pairing), permits to work in the computer only with one strand (forward), the other (reverse) being conceptually retrieved when needed. Usually, genome projects only provide one sequence strand (forward) and forward and reverse elements are annotated on it, the latter simply being tagged as reverse. Usually, again, note that the cell has access to both strands at the same time.

  4. The anti-parallel nature of the double helix (due to 5' and 3' nucleotide ends), gives polarity to the strands. There is a general agreement to write DNA sequences from 5' to 3' (do not confuse this fact with the forward and reverse concept). The 5' region is also known as the upstream region and, therefore, the 3' region is called the downstream region of the sequence. Be aware that the upstream and downstream concepts have nothing to do with the cis and trans relationship between biological elements, such as trancriptor factors and their DNA binding site (see below).

  5. The triplet nature of the genetic code (a codon is 3 nucleotides long), permits to translate any potentially coding sequence in six different ways (frames). Three in the forward strand and 3 more in the reverse sense of the sequence. The finding of the right frame by the ribosome (it already knows the strand) is a challenging process to reproduce computationally.
DNA to chromosomes
From DNA to chromosomes

Note that, in this course, we only analyze DNA at the sequence level. So, let's now check it. Please, select the chromosome 21 in the karyotype plot.

Questions:

  1. What is the difference between known and novel genes?
  2. What does SNP stand for?
  3. Why the chromosome short arm (p arm) is shaded? Hint: is there any feature annotated in these region?
  4. Is there a correlation between repeats and the GC content?

Now, let's search for a specific gene over the whole genome. In the Find window, lookup the "alcohol dehydrogenase" gene.

Questions:

  1. How many entries do you get? Why?
  2. Select the first entry and display the gene report. In which chromosome is the gene?

However, before analyzing the alcohol dehydrogenase gene, we will take a look at the eukaryotic gene structure, processing and expression (see below).


Gene expression: from DNA to RNA to protein
Transcription, splicing and translation are the main processes that account for the expression of protein coding genes. Each step is directed by sequential and structural signals. In what follows, we describe from both the biological and computational point of view, how these sequence motifs are used to go from DNA to RNA to the final protein product. The schema below, highlights these processing steps:


Gene
mRNA processing pathway

Locate the "Prediction Transcript" section in the alcohol dehydrogenase gene report. Note the correspondence between levels of reported data and processing steps:

  • Exon information (gene structure on the DNA sequence): sequence before transcription

  • Transcript information (mature mRNA sequence): sequence after splicing

  • Protein information: sequence after translation

We will follow these links in the order shown, but first, move to the text below for a more precise discussion of the eukaryotic gene structure.


Transcription
Transcription starts when a region upstream of the gene (promoter region) is activated (bound) by transcription factors. These regions, control whether a gene is transcribed from the forward or reverse strand. In any case, the strand which is actually transcribed is called template or sense strand and the other, nonsense or antisense strand.

promoter region
Promoter region

In short, transcription is the copying of DNA (template strand) to RNA (pre-mRNA). However, when analyzing mRNA, cDNA or EST data, bear in mind that the mRNA to be translated is, in sequence, identical to the coding strand (coding here always refers to translation, and not to transcription). That is, the mRNA is transcribed from the strand that has its complementary sequence. In conclusion, when annotating genomes, genes are annotated in relation to their coding strand.

transcription
The copy of the template strand

Cis and Trans Elements
The transcription process brings us to two relevant terms in relation to the study of gene regulation: cis and trans elements. A locus is a cis-acting element if it must be on the same DNA molecule in order to have its effect. Transcription factor binding sites are a good example of cis-acting regulatory elements. A locus is trans-acting if it can effect a second locus even when on a different DNA molecule. Transcription factors are a good example of trans-acting regulatory elements. Note, again, that these terms are distinct from the notions of upstream and downstream region explained above.

Another biological example is the so-called trans-splicing, where exons from different transcripts are spliced and joined together. That is, elements from independent sequences end up acting together in the same mature mRNA.

Gene Structure
Eukaryotic genes are short DNA stretches within a genome with a peculiar and discrete structure.


Gene
Schematic representation of a two exon eukaryotic gene on a DNA sequence

Gene prediction programs make use of this structure to find genes in a genome. The main characteristics are:

  • Coding and non coding exons (UTRs)

  • Introns

  • Translation start site (ATG)

  • Splice sites (GT, donor and AG, acceptor)

  • Translation termination site (STOPs: TAG, TGA and TAA)

Follow the Exon information link.

Questions:

  1. How many exons has this gene?
  2. Are the exons completely coding?
  3. What is the difference between upstream/downstream, UTRs and intron sequences?
  4. Can you spot any common pattern on the start/end of introns?
  5. What is the supporting evidence for this gene? Would you trust it?

Splicing
Splicing is an RNA-processing step in which introns in the primary transcript are removed. Splicing signals, GT (donor) and AG (acceptor) in the intron region, are used to delimit exon-intron boundaries, so that exons (coding and non-coding ones) are joined together. In this way, the open reading frame sequence along with the 5' and 3' Untranslated Regions (UTRs) are ready to be processed by the ribosome.

splicing
The spliceosome complex removes
intron sequences (the exons are spliced together)

Follow the Transcript information link.

Questions:

  1. Find out how to highlight the coding and non-coding regions (UTR) in the transcript
  2. Can you think of a biological role for UTRs?
  3. Check that start and end of the coding region have the right signals.
  4. Which of the three stop codons does this mRNA have?
  5. How many SNPs are annotated? in the UTR? in the CDS? Is it a synonymous change?

Translation
In translation the mature mRNA sequence is translated into a protein. Again, the ribosomal machinery is guided by several signals along the mRNA sequence to find the right open reading frame (ORF) and to determine where translation should terminate.

Gene
Translation maps RNA to proteins, from a 3 letter code to a 1 letter code

There are three types of transcript data:

  1. mRNA: messenger RNA

  2. cDNA: a full-length (not always) and high-quality copy of an mRNA sequence

  3. EST: expressed sequence tag. A partial and low-quality copy of an mRNA sequence

Follow the Protein information link.

Questions:

  1. What is the length of the protein?
  2. Are all exons of similar length?
  3. In which chromosomes are other members of this family?
  4. Is there any known protein domain annotated?
  5. Can you get the signature for this domain?
  6. Take a look at other genes with this domain.
  7. What is the function of this protein? Do you think this information is essential?

Let's now try to get a deeper insight into the biological role of this gene. We will connect to a couple of web-based resources to learn more about our gene and protein. GeneCards is a database of human genes, their products and their involvement in diseases. Connect to this database and search GeneCards by symbol/alias for the approved HUGO adh symbol: AKR1A1 (as shown in the Ensembl gene report).

Questions:

  1. Does it have alternative transcripts? Hint: go to GeneAtlas
  2. What is the Gene Ontology?
  3. Check gene expression in human tissues. What is an electronic northern?
  4. Do you think that, given the number of clones, the electronic northern is significant?
  5. Can you list any disorders and mutations in which this gene is involved?
  6. What is its cellular location? Hint: go to GeneAtlas

Finally, we will try to get all the possible identifiers for this gene across several databases. Connect to GeneLynx and do a quick search in human for the HUGO ID: AKR1A1.

Questions:

  1. What is the SwissProt ID?
  2. What is the PDB ID? Which method was used to characterize this structure? at which resolution?
  3. Can you get a picture of the assumed biological protein at PDB?
  4. Can you get the metabolic pathway in which this gene is involved? Hint: go to KEGG pathway

Data integration
It is about time to fully use the main feature of genome browsers: the ability to display all available information along a genomic region of interest. However, first take a look at the picture below to make sure you understand the pipeline behind these data.

data integration
Genome analysis: from sequencing to annotation

We will browse the genomic region of the adh gene that we are working with. Follow the this link. There are four levels of resolution:

  1. Chromosome view
    • In which arm of the chromosome 1 is the adh gene?

  2. Overview
    • Is there any novel gene in this region? And pseudogene?
    • What are the mouse and rat synteny tracks?

  3. Detailed view
    • Make sure you understand each track.

  4. Basepair view
    • Take a look at the region translated in the 3 possible frames for each strand (forward and reverse).

Synteny

The mouse and rat synteny tracks above call for a brief discussion of this concept. Syntenic regions between genomes are regions in which gene order is conserved. Syntenic genes are therefore putative homologues that have an orthologous, parologous or even xenologous relation. See below for a short definition of these key but often misinterpreted concepts.


Homology: orthology, parology and xenology

Homologous genes are genes that are related through a common evolutionary ancestor. Homology is usually inferred on the basis of sequence similarity but bear in mind that, through random and convergent evolutionary processes, biological sequences can share a reasonable degree of similarity without a true evolutionary relationship. In addition, it is incorrect to say that a pair of related genes are, for example, 80% homologous, because genes are either evolutionarily related or not. On the other hand, one can speak of a percentage of sequence similarity between genes.

This is of importance, for instance, when reading a blast output and deriving evolutionary implications. The score and the e-value are based on the similarity between sequences. However, this does not necessarily ensure a close phylogenetic relationship, although it suggests one.

Orthology, paralogy and xenology are homology subtypes. That is, they define a specific type of relationship between genes over space and time. Read these definitions carefully:

  • Orthologous genes are those homologues that are present in different organisms and have evolved from a common ancestral gene by speciation.

  • Paralogous genes are present in the same organism or in different organisms and have evolved from a common ancestral gene by a gene duplication event. If this gene duplication event took place before a speciation event, there are paralogous genes in different genomes.

  • Xenologues are homologues that originated by an interspecies (horizontal) transfer of the genetic material for one of the homologues.

    This paper discusses these concepts in more detail.

Alternative Transcription
The transcription start site, can vary in the same gene depending on how the promoter region is activated. This results in a different pre-mRNA and, potentially, a differentially expressed mRNA or even a distinct protein may be achieved.

Alternative Splicing
Briefly, alternative splicing is an important cellular mechanism that leads to temporal and tissue specific expression of unique mRNA products. This is accomplished by the usage of alternative splice sites that result in the differential inclusion of RNA sequences (exons) in the mature mRNA.

Alternative splicing
Alternative splicing produces unique mRNA products

In general, current gene prediction programs cannot predict alternative mRNAs in a reliable way, unless transcriptional data (mRNAs, cDNAs and ESTs) are available.


Alternative Translation
Translation by the ribosome is a complex process. Sources of variability are:
  1. In the same mRNA, alternative translation start sites (ATG) can be used. Furthermore, translation can even start at GTG (valine), TTG (leucine), ATT (isoleucine) but they still code for methionine when they function as an initiator codon.

  2. Alternative decoding (recoding):

    • In the ribosome, the meaning of specific codons can be redefined

    • The ribosome can alter the reading frame by switching from one overlapping reading frame to another

    • The ribosome can bypass a stretch of sequence, with or without a change in the reading frame.

  3. In the same mRNA, alternative poly-A sites modify the 3' UTR region.

LINUX and genome annotation
Try to complete this practical.