|
SUPPLEMENTARY MATERIALS FOR
Comparative gene prediction
in human and mouse
G. Parra, P. Agarwal, J.F. Abril,
T. Wiehe, J.W. Fickett and R. Guigó *.
Genome Research 13(1):108-117 (Jan 1, 2003)
[ PubMed ]
[Abstract]
[Full Text]
*
To whom correspondence should be adressed.
Email: rguigo@imim.es. Ph: +034 93-224-0877.
SGP2 is a program to predict genes by comparing anonymous
genomic sequences from two different species. It combines
tblastx (WU-Blast), a sequence similarity search program, with
geneid, an "ab initio" gene prediction program. In
"assymetric" mode, genes are predicted in one sequence from one
species (the target sequence), using a set of sequences (maybe only
one) from the other species (the reference set). Essentially,
geneid is used to predict all potential exons along the
target sequence. Scores of exons are computed as log-likelihood
ratios, function of the splice sites defining the exon, the coding
bias in composition of the exon sequence as measured by a Markov Model
of order five, and of the optimal alignment at the amino acid level
between the target exon sequence and the counterpart homologous
sequence in the reference set. From the set of predicted exons, the
gene structure is assembled (eventually multiple genes in both
strands) maximizing the sum of the scores of the assembled exons.
CONTENTS
SGP2 Test Sets
|
|
IMOG dataset
This is a list of 15 pairs of single gene sequences, with little
overlap with the Sanger Center data set [Jareborg et al., Genome Research 9(9):815, 1999]. The gene accession-pairs associates a gene id to the corresponding human-mouse pair (first the human sequence, then the mouse).
You can donwload a tarball containing all of above from here.
BI dataset
These are three pairs of multigene sequences. Annotation is not available for all the sequences, and of unknown reliability.
You can donwload a tarball containing all of above from here.
SCIMIT dataset
This set contains 129 pairs of single gene sequences and combines non-overlaping sequences from IMOG (see above), the Sanger Center [Jareborg et al., Genome Research 9(9):815, 1999] and the MIT [Batzoglou et al., Genome Research 10(7):950, 2000] data sets. The gene accession-pairs associates a gene id to the corresponding human-mouse pair (first the human sequence, then the mouse).
You can donwload a tarball grouping all of above from here.
Gene predictions: FINISHED HOMOLOGOUS SEQUENCES
|
|
Finished Orthologous
SGP2 predictions on the eigth human/mouse homologous sequences browsed from "http://pipeline.lbl.gov/TESTS/" (including MHC). Unfortunately, that URL is no longer available. We just added a column (see Sequences and Annotations) to the table appearing below, which contains the human and mouse fasta files and the corresponding human annotations we have obtained from there.
Each human sequence was compared against the corresponding homologous mouse sequence.
We introduced few changes in the PostScript maps:
- We show the "real" length of annotated genes (taking into account first and last UTR coords), but we still display only the anontated CDS's.
- As we ran our programs on the original masked sequence, we are displaying only the masked regions for each sequence without labeling them (in the central axes of each block).
- We also included Twinscan results for the human/mouse homologous set.
FA+RS
|
The set of human and mouse fasta sequences (masked and unmasked), plus the human RefSeqs mapped onto the human sequences in GFF format. There are two GFF files for each region, the "*.pipeline_refseq.gff" having the original annotations produced at Berkeley, and the "*.Korf_refseqs.gff" which contains the subset of hand-curated annotations for the same regions (except for the MHC region that was too big). Those annotations were curated by Ian Korf, see further information at: http://sapiens.wustl.edu/~ikorf/annotation/
|
TBX
|
contains the raw tblastx (WU-Blast) results of each human sequence against the corresponding homologous mouse sequence. tblastx has been run with -nogap , and the blosum62 matrix, were penalty for aligning with stop codons have been set to -500.
|
HSP
|
contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast).
|
GFF
|
`General Feature Format' (GFF) is described on the Sanger Centre gff definition page.
|
GTF2
|
`Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link.
|
A4/A3
|
contains a PostScript map showing SGP2 predictions, altoghether with geneid and genscan predictions, tblastx matches, repeat locations and, when available, annotations of the known genes. Maps were obtained using gff2ps . Two sizes are provided to be printed into a4 or a3 paper size, but we recommend a3 to visualize with ghostview or similar programs.
|
Finished human vs. mouse reads
SGP2 predictions on the eigth human sequences browsed from "http://pipeline.lbl.gov/TESTS/" (including MHC) against the mouse WGS 3X (a database of about 13 milion mouse reads). Unfortunately, that URL is no longer available, see previous table for the sequences and annotation of the ortologous datasets.
We used here human fasta sequences that were masked slightly different than those used in the human/mouse orthologous section.
TBX
|
contains the raw tblastx (WU-Blast) results of each human sequence against a database of about 13 milion mouse reads. tblastx has been run with -nogap , and the blosum62 matrix, were penalty for aligning with stop codons have been set to -500.
|
HSP
|
contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast).
|
GFF
|
`General Feature Format' (GFF) is described on the Sanger Centre gff definition page.
|
GTF2
|
`Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link.
|
A4/A3
|
contains a PostScript map showing SGP2 predictions, altoghether with geneid and genscan predictions, tblastx matches, repeat locations and, when available, annotations of the known genes. Maps were obtained using gff2ps . Two sizes are provided to be printed into a4 or a3 paper size, but we recommend a3 to visualize with ghostview or similar programs.
|
Gene predictions: HUMAN CHROMOSOME 22
|
|
This section contains SGP2 predictions on human chromosome
22. Chromosome 22 annotation was compiled by Victoria Haghighi from
the Columbia Genome Center. The data was downloaded from
http://www.cs.columbia.edu/~vic/sanger2gbd.
There are two sets of SGP2 predictions. The first one are raw prediction along the whole Chromosome 22 sequence (Homology Only). The second one is a set of predictions confined to regions void of annotated genes or pseudogenes (Homology + Evidences). The goal is here predicting novel genes minimizing chimeric predictions. In this case, annotations are taken from the Combined Gene + CDS Set (879 genes).
TBX
|
contains the raw tblastx (WU-Blast) results of each human sequence against a database of about 19 milion mouse reads (WGS). tblastx has been run with the following parameters: -nogap , Z=3000000000 , E=0.01 , W=5 , B=10000 , V=10000 , -hspmax=4 , -topcomboN=4 , -filter=xnu , and a modified blosum62 matrix were penalty for aligning with stop codons have been set to -500.
|
HSP
|
contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast).
|
SR
|
similarity regions in GFF format (but with frames 1,2,3 as in blast), as they were projected from the HSPs (see how they are obtained and how they influence the exons score in the SGP2 algorithm description page).
|
GFF
|
`General Feature Format' (GFF) is described on the Sanger Centre gff definition page.
|
GTF2
|
`Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link.
|
Whole-Genome Gene-Predictions
|
|
The results of SGP2 on human and mouse genomes are available from our new Gene-Prediction section. Follow these links to download them:
|
|