Genome BioInformatics Research Lab

  IMIM * UPF * CRG * GRIB HOME Research * SGP2
Gene Prediction based on Comparative Genomics

Recently, the importance of sequence comparisons between genomes of different species to locate functional domains conserved through evolution (protein coding among them) has been underscored, and new bioinformatics methodologies have been developed to infer protein coding genes from sequence comparisons of the genomes of two different species developed (Batzoglou et al., 2000; Bafna and Hudson, 2000; Wiehe et al., 2001; Korf et al., 2001, Novichkov et al., 2001), which appear to lead to highly accurate predictions. The rationale is that functional regions (protein-coding among them) are more conserved than non-functional ones across the DNA sequence of genomes from different species (see figure below).

We are developing a method to predict genes in the human genome which combines information from sequence signals potentially involved in gene specification (splice sites and start codons, essentially) and from protein-coding induced bias in the nucleotide composition of the DNA sequence, with information from sequence similarity to the mouse genome. Unlike methods previously described, this method does not require fully assembled genomic mouse syntenic regions, and it can be used with fragmentary mouse data at any level of coverage. A preliminary version of this program is being used by the Mouse Genome Sequencing Consortium.

Pairwise comparison using the program tblastx of the human and mouse genomic sequences coding for the HLA class II alpha chain. Red boxes indicate the coding exons, while black diagonals indicate the conserved alignments. The score of the conserved alignments (divided by 10) is given in the lower panels. While conserved regions between the human and mouse genomic sequences coding for these gene fully include the coding exons, a substantial fraction of intronic regions is also conserved.

Relevant publications

  • R. Guigó, E.T. Dermitzakis, P. Agarwal, C.P. Ponting, G. Parra, A. Reymond, J.F. Abril, E. Keibler, R. Lyle, C. Ucla, S.E. Antonarakis and M.R. Brent.
    "Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes."
    PNAS 100(3):1140-1145 (2003)   [ Abstract ]   [Datasets]

  • G. Parra, P. Agarwal, J.F. Abril, T. Wiehe, J.W. Fickett and R. Guigó.
    "Comparative gene prediction in human and mouse."
    Genome Research 13(1):108-117 (2003)   [Abstract]   [Datasets]

  • Mouse Genome Sequencing Consortium (including J.F. Abril, G. Parra and R. Guigó).
    "Initial sequencing and comparative analysis of the mouse genome."
    Nature 420(6915):520-562 (2002)   [Abstract]

  • M.J. Betts, R. Guigó, P. Agarwal and R.B. Russell.
    "Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution?"
    EMBO Journal 20(19):5354-5360 (2001)   [Abstract]

  • T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds and R. Guigó.
    "SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments."
    Genome Research 11(9):1574-1583 (2001)   [Abstract]

  • T. Wiehe, R. Guigó, and W. Miller.
    "Genome Sequence Comparisons: Hurdles in the Fast Lane to Functional Genomics."
    Briefings in Bioinformatics 1(4):381-388 (2000)   [PubMed Abstract]

  Disclaimer webmaster