| 
In what follows, we revise the key concepts regarding the eukaryotic
gene and genome structure needed to understand the annotation of
genomes. In this context, annotation refers to the description and
location of genes and other biologically relevant features of a
genomic sequence. Our main goal is to fully comprehend what genome
annotation projects offer and, just as important, what they do not yet provide.
 
In this regard, it's worth noting that current gene prediction programs, among other bioinformatics tools, systematically ignore the complexity of eukaryotic gene structure. Diversity comes from alternatively spliced genes, non-canonical signals (that affect either splicing or translation) and from regions that control gene transcription, promoters, which are not yet well understood. Here, we give and overview to see some of the limitations and future directions in the gene prediction field.
 | 
| The genome is the genetic material of an organism, that is, the total
amount of DNA in the cell. In eukaryotes, it is usually organized into
a set of chromosomes, which are extremely long chains of DNA that are highly condensed. In the picture below, human DNA is shown packaged into chromosome units (as seen during mitotic metaphase). Note the sister chromatids (that contain identical daughter DNA molecules), centromeres and telomeres. 
 
   Human chromosomes
 
It is time to introduce the three main genome browsers and check which genomes they are serving. In this practical, we will stick to the Ensembl server, but feel free to browse the others later on.
 
Questions:
   Now, select human and make sure you understand the main concepts in this page. Contrary to the view reported in newspapers, the genome sequence and its corresponding annotations are highly dynamic. There are two levels at which data can be updated: 1) sequence level and 2) annotation level. How many species does Ensembl provide annotation for? 
    What is the Trace server? Hint: how are genomes sequenced?
    What is the difference between the BLAST and SSAHA programs? Hint: think of a strict blast search
 
Questions:
    How many releases has Ensembl provided annotation for?
    What are the main reasons that drive the release of genome versions?
    How many genes, exons and nucleotides does the last genome
   version have?
    Do you think that, over the same sequence, Ensembl, NCBI and Golden Path provide the same annotation?
    What is the Vega annotation? Do you think it is more reliable than the default Ensembl annotation?
 | 
| DNA molecules consist of two anti-parallel chains held together by complementary base pairs that form a double helix. This structure is of major importance for computational analysis and, to a great extend, determines how it can be performed: 
The digital nature of the sequence (nucleotides: Adenine, Guanine, Cytosine and Thymine) permits an easy and symbolic computational representation as A, G, C and T letter codes, respectively. It is worth knowing that Uracil (U), which is in place of Thymine in RNA, is also written as T in sequence databases. 
 
The double helical nature of DNA, gives us two different sequences to
analyze (with distinct encoded information). In order to handle such
dual data, the concept of forward or positive (+) and
reverse or negative (-) strands and elements (genes, exons,
introns...) is introduced. The forward strand, for us, is simply the
original sequence we are working on. Note that this concept is
meaningless in the cell, so no differences are made between strands in
the cell. For example, genes are transcribed from both chains. 
 
The complementary nature of the two strands (A-T, G-C base-pairing),
permits to work in the computer only with one strand (forward), the
other (reverse) being conceptually retrieved when needed. Usually,
genome projects only provide one sequence strand (forward) and forward
and reverse elements are annotated on it, the latter simply being
tagged as reverse. Usually, again, note that the cell has access to both strands at the same time. 
 
The anti-parallel nature of the double helix (due to 5' and 3' nucleotide ends), gives polarity to the strands. There is a general agreement to write DNA sequences from 5' to 3' (do not confuse this fact with the forward and reverse concept). The 5' region is also known as the upstream region and, therefore, the 3' region is called the downstream region of the sequence. Be aware that the upstream and downstream concepts have nothing to do with the cis and trans relationship between biological elements, such as trancriptor factors and their DNA binding site (see below). 
 
The triplet nature of the genetic code (a codon is 3 nucleotides long), permits to translate any potentially coding sequence in six different ways (frames). Three in the forward strand and 3 more in the reverse sense of the sequence. The finding of the right frame by the ribosome (it already knows the strand) is a challenging process to reproduce computationally.
   From DNA to chromosomes
 
Note that, in this course, we only analyze DNA at the sequence level. So, let's now check it. Please, select the chromosome 21 in the karyotype plot.
 
Questions:
     What is the difference between known and novel genes?
     What does SNP stand for?
     Why the chromosome short arm (p arm) is shaded? Hint: is there any feature annotated in these region?
     Is there a correlation between repeats and the GC content? 
 
 
Now, let's search for a specific gene over the whole genome. In the Find window, lookup the "alcohol dehydrogenase" gene.
 
Questions:
    
     How many entries do you get? Why?
     Select the first entry and display the gene report. In which chromosome is the gene?
	       
 
However, before analyzing the alcohol dehydrogenase gene, we
will take a look at the eukaryotic gene structure, processing and expression (see below).
 | 
| Transcription, splicing and translation are the main processes that
account for the expression of protein coding genes. Each step is
directed by sequential and structural signals. In what follows, we
describe from both the biological and computational point of view, how these sequence motifs are used to go from DNA to RNA to the final protein product.
The schema below, highlights these processing steps: 
 
 
   mRNA processing pathway
 
Locate the "Prediction Transcript" section in the alcohol dehydrogenase gene report. Note the correspondence between levels of reported data and processing steps:
     
         Exon information (gene structure on the DNA sequence): sequence before transcription 
 
 Transcript information (mature mRNA sequence): sequence after splicing 
 
 Protein information: sequence after translation 
     
We will follow these links in the order shown, but first, move to the text below for a more precise discussion of the eukaryotic gene structure.
 | 
| Transcription starts when a region upstream of the gene (promoter region) is activated (bound) by transcription factors. These regions, control whether a gene is transcribed from the forward or reverse strand. In any case, the strand which is actually transcribed is called template or sense strand and the other, nonsense or antisense strand. 
 
   Promoter region
 
In short, transcription is the copying of DNA (template strand) to RNA (pre-mRNA). However, when analyzing mRNA, cDNA or EST data, bear in mind that the mRNA to be translated is, in sequence, identical to the coding strand (coding here always refers to translation, and not to transcription). That is, the mRNA is transcribed from the strand that has its complementary sequence. In conclusion, when annotating genomes, genes are annotated in relation to their coding strand. 
 
   The copy of the template strand
 | 
| Eukaryotic genes are short DNA stretches within a genome with a peculiar and discrete structure. 
 
 
   Schematic representation of a two exon eukaryotic gene on a DNA sequence
 
Gene prediction programs make use of this structure to find genes in a
genome. The main characteristics are:
 
    Follow the Exon information link. Coding and non coding exons (UTRs) 
 
 Introns 
 
 Translation start site (ATG) 
 
 Splice sites (GT, donor and AG, acceptor) 
 
 Translation termination site (STOPs: TAG, TGA and TAA) 
 
 
Questions:
     How many exons has this gene?
     Are the exons completely coding?
     What is the difference between upstream/downstream, UTRs and intron sequences?
     Can you spot any common pattern on the start/end of introns?
     What is the supporting evidence for this gene? Would you trust it?
 | 
| Splicing is an RNA-processing step in which introns in the primary transcript are removed. Splicing signals, GT (donor) and AG (acceptor) in the intron region, are used to delimit exon-intron boundaries, so that exons (coding and non-coding ones) are joined together. In this way, the open reading frame sequence along with the 5' and 3' Untranslated Regions (UTRs) are ready to be processed by the ribosome. 
 
   The spliceosome complex removes
 intron sequences (the exons
are spliced together)
 
Follow the Transcript information link.
 
Questions:
     Find out how to highlight the coding and non-coding regions (UTR) in the transcript
     Can you think of a biological role for UTRs?
     Check that start and end of the coding region have the right signals.
     Which of the three stop codons does this mRNA have?
     How many SNPs are annotated? in the UTR? in the CDS? Is it a synonymous change?
 |