Computational Analysis of DNA Sequences Gene Prediction Techniques |
This short course, on the analysis of DNA sequences through
internet resources, is aimed at those willing to characterize protein
coding genes in eukaryotic genomes. First, we examine basic concepts
on genomes and gene structure in eukaryotes and learn how to extract
genomic information from widely use online databases. Then, we
generate our own annotation of protein coding genes on a real genomic
sequence and see current limitations of gene prediction
programs. Finally, we make use of a state-of-the-art comparative
genomics approach to refine our predictions.
Use blue links on your left to follow the course.
The concepts needed to understand eukaryotic genomes and gene
structure are revised. Protein and non-protein coding genes, coding
and non-coding exons, forward and reverse annotations...
The basics of mRNA processing steps useful to understand eukaryotic
gene structure are also outlined. In addition, the analysis of protein
sequences at a functional level are briefly introduced.
The sequences of multiple genomes and their annotations are now available.
For the biologist it is crucial to be able to access this information
without having to rely on programming skills. Additionally, the researcher must be able
to query the databases with biologically relevant questions.
In this short practical we introduce the Ensmart system as a way to
access genomic data through high-level biological queries.
We emphasize the importance of how the biological data is structured.
Many diseases are caused by mutations in the DNA. In some cases the disease is hereditary These diseases are usually caused by mutations in a single gene that makes the protein it encodes not to function properly or not to function at all. These are called Mendelian diseases or hereditary diseases, and can have different type of inheritance (Dominant, Recessive or X-linked).
In this short practical we introduce the access to several databases that can help us to find information related to diseases, mutations and polymorphisms, as well as the access to some web-servers that predict whether a given gene could be involved in disease based on the existing set of known disease genes.
The finding of protein-coding genes on a genome sequence is a
complex task. Within millions of non-coding nucleotides, very short
stretches of DNA which actually code for a protein (coding exons) lie
scattered. This tiny coding fraction, can be unveiled making use of
the biological properties and the particular statistical composition
found in these regions. Gene prediction programs are computational
tools able to find these dispersed coding exons in a sequence and
then, to provide the best tentative gene models.
As we will see, this ab initio gene prediction approach is
useful but of a limited accuracy.
Comparative genomics is the analysis and comparison of genomes
from different species. The purpose is to gain a better understanding
of how species have evolved and to determine the function of genes and
noncoding regions of the genome. Researchers have learned a great deal
about the function of human genes by examining their counterparts in
simpler model organisms such as the mouse. Genome researchers look at
many different features when comparing genomes: sequence similarity,
gene location, the length and number of coding regions (called exons)
within genes, the amount of noncoding DNA in each genome, and highly
conserved regions maintained in organisms as simple as bacteria and as
complex as humans.
Modern gene prediction programs can integrate these comparative
data to improve predicted genes.
ENSEMBL | http://www.ensembl.org | |
NCBI | http://www.ncbi.nlm.nigh.gov | |
Golden Path (UCSC) | http://genome.ucsc.edu | |
geneid | /software/geneid/geneid.html | |
genscan | http://genes.mit.edu/GENSCAN.html | |
fgenesh | http://www.softberry.com/berry.phtml?topic=gfind | |
sgp2 | /software/sgp2/sgp2.html | |
twinscan | http://genes.cs.wustl.edu | |
slam | http://baboon.math.berkeley.edu/~syntenic/slam.html | |
This course as a single PDF document | /courses/laCaixa05/laCaixa05.pdf |
Josep F. Abril | Enrique Blanco | Charles Chapple | ||
Sergi Castellano | Robert Castelo | Eduardo Eyras | ||
Roderic Guigó | Núria López | Genís Parra |