Gene annotation of genomes

ABSTRACT

We take a look at current genome browsers. Potential gene structures along with evidence supporting them are shown in a single (and browsable) display. Then, some gene locator resources which cross-link databases are introduced.

Genome Browsers

Gene Locators

Browsing a gene

Think of your favourite gene! Some suggestions for human:

Try to get information about some of these genes from the resources above. Can you tell:

Genome annotation pipeline

As an example, the Ensembl Genome annotation pipeline is briefly described:
  1. Genome assembly: modern DNA sequencing technology can only determine accurate sequences of short stretches of DNA (less than 1000 base pairs). Since the human genome is in excess of 3 billion base pairs long the genome has had to be sequenced in many small pieces that must be reassembled afterwars. The pieces are reassembled by comparing the sequence of the ends to find overlaps which can be used to join them.

  2. Gene prediction:
    1. Genscan is used to predict the location of genes along the genome.
    2. These candidate genes are then compared to all known genes in public databases. Matches provide supporting evidence.
    3. Predicted genes are stored in a database for easy retrieving.

  3. Prediction of other features:
  4. Display annotated genome: a genomic view, at several levels of resolution, showing all predicted features is shown.

In the practical, following the scheme above we annotate an already assembled genomic sequence. First we search for repeats, mask sequence, do gene prediction, validate predicted genes and finally plot all predicted data.