Genome Informatics Research Lab |
Help | News | People | Research | Software | Publications | Links |
Resources & Datasets Gene Predictions | Seminars & Courses |
|
Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes R. Guigó, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, J. F. Abril, E. Keibler, R. Lyle, C. Ucla, S. E. Antonarakis, and Michael R. Brent *. PNAS, 100(3):1140-1145 (Feb 4, 2003) [ PubMed ] [ Abstract ] [ Full Text ] * To whom correspondence should be adressed. Email: brent@cs.wustl.edu. Ph: +01 314-935-6621.
A primary motivation for sequencing the mouse genome was to accelerate the discovery of mammalian genes by using sequence conservation between mouse and human to identify coding exons. This proved challenging due to the large proportion of the mouse and human genomes that is apparently conserved but not protein-coding. We developed two programs, SGP2 (Parra et al, 2002) and Twinscan (Korf et al, 2001; Flicek et al, 2002), which can exploit sequence conservation between genomes to identify candidate genes despite the abundance of conserved non-coding sequence. We also developed an enrichment process that selects a subset of highly reliable candidates by exploiting conservation in mouse-human exonic structure. RT-PCR amplification and direct sequencing applied to an initial sample of the predictions that do not overlap previously known genes verified 139 predictions. On average, the confirmed predictions show more restricted expression patterns than the mouse orthologues of the genes on human chromosome 21, and the majority lack both aligned mouse EST sequences and homologues in the fish genomes, demonstrating the sensitivity of SGP2 and Twinscan to hard-to-find genes. We verified 68 novel homologues of known proteins, including two homeobox proteins relevant to developmental biology and an aquaporin. We estimate that 1000 gene predictions that do not overlap known genes can be verified by this method. This is likely to constitute a significant fraction of the previously unknown, multi-exon, mammalian genes.
GENE-PREDICTION BASED on TWO GENOMES COMPARISONSGP2
SGP2 (/software/sgp2/) is a program to predict
genes by comparing anonymous genomic sequences from two different
species. In this paper, prediction have been done on the mouse genome
(MGSCv3 assembly) using comparative information from the human genome
(December,2001 GoldenPath equivalent to NCBI Build 28), both sets of
sequences taken from http://genome.ucsc.edu/. To make
the predictions, SGP2 combines TBlastX
(WU-BLAST version,
http://blast.wustle.edu), a sequence similarity search program,
with geneid (Guigó et al, 1992;
Parra et al, 2000), an "ab initio" gene
prediction program. The mouse sequences was cut into 100kb fragments
to build the blast database. The masked human chromosomes were also
cut in 100kb fragments which were run against the mouse database using
TBlastX with the following parameters: B=9000 V=9000 hspmax=500 topcomboN=100 W=5 matrix=blosum62mod E=0.01 E2=0.01 Z=3000000000 nogaps filter=xnu+seg S2=80
Although these parameters increase the speed of the comparison, the
whole computation took one week of CPU time using 100 Alpha
processors. The resulting high-scoring segment pairs (HSPs) were
processed to find the maximum scoring projection. Further information
on the HSPs modifications and on the general SGP2 algorithm
can be found at:
/software/sgp2/algorithm/index.html. TWINSCAN
The Twinscan method is described in Korf et al, 2001. This paper is freely available online, and can be viewed by clicking here. M=1 N=-1 Q=5 R=1 Z=3000000000 Y=3000000000 B=10000 V=100 W=8 X=20 S=15 S2=15 gapS2=30 lcmask wordmask=seg wordmask=dust topcomboN=3
Twinscan was run using these alignments. The target genome
parameters were identical to Genscan parameters
(Human.iso) and the conservation parameters were the ones we
identify as "68-set-ortholog" (available upon request). FILTERING BEST CANDIDATES
In this section we describe the protocol to generate a set of
SGP2 and Twinscan mouse predictions to be tested by
RT-PCR experiments. The goal of this approach is to generate a more
reliable set of predicted genes. For SGP2 and
Twinscan, the experiment, performed independently within each
predictor, consist in:
A subset of random predictions was extracted from each set (SGP2 and Twinscan), and two adjacent exons across an intron were chosen from the selected predictions for the RT-PCR test. The experimental setup required that the exons were at least 30bp long, and the introns were at least 1000bp long. Pairs of exons verifying these requirements are sorted by the sum of the scores of the exons, and the top scoring pair was selected for the RT-PCR test. RT-PCR VALIDATION of GENE-PREDICTIONSThe expression of a subset of the mouse gene models of the HC21 genes was tested by RT-PCR. Total RNA derived from 12 different normal mouse adult tissues (brain, heart, kidney, thymus, liver, stomach, muscle, lung, testis, ovary, skin and eyes) was extracted, retro-transcribed and normalized as previously described (Reymond et al, 2002). The quality of total RNA was tested by PCR using MLH1 primers located at intronic sequences flanking exon 12 (Forward - 5' TGG TGT CTC TAG TTC TGG 3' and Reverse - 5' CAT TGT TGT AGT AGC TCT GC 3'), as an indicator of possible genomic DNA contamination. Primers for RT-PCR were designed the Primer3 program (http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi). Primers were designed from the sequence of distinct exons so that the possible amplification of genomic DNA could be distinguished from cDNA amplification. We chose a single PCR rather than a nested-PCR approach to avoid false positive results due to illegitimate transcription (Kaplan, 1992). Similar amounts of the 12 cDNAs (final dilution 250x) were mixed with JumpStart REDTaq ReadyMix (Sigma) and 4 ng/µl of each of the primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50ºC, annealing temperature of the next 35 cycles was carried out at 50ºC. Amplimers were separated on "Ready to Run" precast gels (Pharmacia). Positives were directly sequenced. HOMOLOGY SEARCH for PREDICTED PROTEINS
Predicted amino acid sequences were compared with a non-redundant
protein sequence database (
ftp://ftp.ncbi.nih.gov/blast/db/nr) using Blastp (Altschul et al., 1997). Homologues were assigned
where sequence pairs were aligned with Expect values less than
5x10e-3. These assignments were augmented by further
TBlastn sequence comparisons with expressed sequence tag
databases (in particular,
ftp://ftp.ncbi.nih.gov/blast/db/est_others).
WHOLE GENOME PREDICTIONS
This section summarizes all the gene predictions on the mouse genome obtained from its comparison against the human genome. We include the reciprocal human mouse-based gene predictions. SGP2
TWINSCAN
RT-PCR POOLS
In this section you can access to supplementary
data of a fraction of genes selected for the RTPCR experiment,
collected in different files linked from the table below. The number
of 1019 novel genes given in the paper is in extrapolation from the
sucess rates observed in a random sample from each pool. Similarly,
the number of total non-redundant predictions in each pool is not the
direct sum of the number of predictions by sgp and twinscan, because
often sgp and twinscan predict overlapping genes: these have been
counted only once in the pools given in the paper. In the following
table we have included the sequences and coordinates of all the SGP
and twinscan genes and a table with the cooresponding pairs of
overlapped genes.
You can also get a table for the whole RT-PCR positive genes pool. It contains their Geneva Code and the corresponding Gene-Prediction Identifier together with a link to each gene Summary Datasheet. Just click on this link to retrieve such table. PREDICTED GENES BEST CANDIDATES
The following links will open a browser window displaying a summary table with all the 476 genes that were submitted for the RT-PCR validation test, classified by mouse chromosome:
|
|