Computational Analysis of DNA Sequences: Gene Prediction Techniques

This short course, on the analysis of DNA sequences through internet resources, is aimed at those willing to characterize protein coding genes in eukaryotic genomes. First, we examine basic concepts on genomes and gene structure in eukaryotes and learn how to extract genomic information from widely use online databases. Then, we generate our own annotation of protein coding genes on a real genomic sequence and see current limitations of gene prediction programs. Finally, we make use of a state-of-the-art comparative genomics approach to refine our predictions.

The concepts needed to understand eukaryotic genomes and gene structure are revised. Protein and non-protein coding genes, coding and non-coding exons, forward and reverse annotations...

The basics of mRNA processing steps useful to understand eukaryotic gene structure are also outlined. In addition, the analysis of protein sequences at a functional level are briefly introduced.

The sequences of multiple genomes and their annotations are now available. For the biologist it is crucial to be able to access this information without having to rely on programming skills. Additionally, the researcher must be able to query the databases with biologically relevant questions.

In this short practical we introduce the Ensmart system as a way to access genomic data through high-level biological queries. We emphasize the importance of how the biological data is structured.

Many diseases are caused by mutations in the DNA. In some cases the disease is hereditary These diseases are usually caused by mutations in a single gene that makes the protein it encodes not to function properly or not to function at all. These are called Mendelian diseases or hereditary diseases, and can have different type of inheritance (Dominant, Recessive or X-linked).

In this short practical we introduce the access to several databases that can help us to find information related to diseases, mutations and polymorphisms, as well as the access to some web-servers that predict whether a given gene could be involved in disease based on the existing set of known disease genes.

The finding of protein-coding genes on a genome sequence is a complex task. Within millions of non-coding nucleotides, very short stretches of DNA which actually code for a protein (coding exons) lie scattered. This tiny coding fraction, can be unveiled making use of the biological properties and the particular statistical composition found in these regions. Gene prediction programs are computational tools able to find these dispersed coding exons in a sequence and then, to provide the best tentative gene models.

As we will see, this ab initio gene prediction approach is useful but of a limited accuracy.

Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome. Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans.

Modern gene prediction programs can integrate these comparative data to improve predicted genes.

ENSEMBL		`http://www.ensembl.org`
NCBI		`http://www.ncbi.nlm.nigh.gov`
Golden Path (UCSC)		`http://genome.ucsc.edu`
`geneid`		`/software/geneid/geneid.html`
`genscan`		`http://genes.mit.edu/GENSCAN.html`
`fgenesh`		`http://www.softberry.com/berry.phtml?topic=gfind`
`sgp2`		`/software/sgp2/sgp2.html`
`twinscan`		`http://genes.cs.wustl.edu`
`slam`		`http://baboon.math.berkeley.edu/~syntenic/slam.html`
This course as a single PDF document		`/courses/laCaixa05/laCaixa05.pdf`

Materials for this course have contributions from:

Josep F. Abril Enrique Blanco Charles Chapple

Sergi Castellano Robert Castelo Eduardo Eyras

Roderic Guigó Núria López Genís Parra

Introduction

Overview

Genes and Genomes

Mining the Genome

Genes and Disease

Gene Prediction

Comparative Genomics

Internet Resources

Acknowledgements

Course sponsored by

Josep F. Abril	Enrique Blanco	Charles Chapple
Sergi Castellano	Robert Castelo	Eduardo Eyras
Roderic Guigó	Núria López	Genís Parra