Our group maintains the following Data Sets:
DATASET is available from this link: /datasets/ESEselection2008/
DATASET is available from this link: /datasets/racearrays2007/
Datasets and results of human-mouse-chicken-zebrafish orthologous gene regions that were used to train and optimize the parameters of the multiple TF-map alignment. Characterized real promoters and enhancers, artificial non-collinear examples.
DATASET is available from this link: /datasets/mmeta2006/
Database of clusters of orthologous U12 introns from 18 animal, 1 plant and 1 fungal species.
DATASET is available from this link: /datasets/u12/
Dataset of the 40 human-mouse gene promoter pairs that was used to optimize the parameters of the TF-map alignment. Dataset of different genomic orthologous regions for these genes. Dataset and results of the TF-map alignment on the 5333 CISRED human co-expressed genes.
DATASET is available from this link: /datasets/meta2005/
650 experimentally verified orthologous transcription factor binding sites (TFBSs). Annotations have been collected from the literature. This collection also includes the promoter sequences, cross-references to EntrezGene, PubMed and RefSeq, predictions by weight matrices collections, sequence alignments and graphical dotplots.
DATASET is available from this link: /datasets/abs2005/
Different evaluation programs were used to compare the accuracy of the gene predictions submitted to the GENCODE EGASP'05 workshop, held at the Sanger Center on May 6-7, 2005. The results from those evaluations are provided here, along with some discussion on the different methods to calculate the accuracies of each different approach at three levels of the gene structure (basically at nucleotide, exon, transcript/gene levels).
DATASET is available from this link: /datasets/egasp2005/
Datasets of 311 putative novel human genes found using the comparative gene predictor SGP2 and the chicken genome sequence, the subset of 50 most promising predictions tested by RT-PCR and the GenBank accessions of the six RT-PCR positives.
DATASET is available from this link: /datasets/ggalhsapgenes2005/
Datasets for the comparative analysis of splice site sequences on a large collection of human, mouse, rat and chicken introns. The analyses performed on those datasets were focussing on the conservation of orthologous splice sites, the evolution of the U2/U12 major intron classes and the subtype switching within those classes.
DATASET is available from this link: /datasets/hmrg2004/
Datasets of human splice sites from RefSeq-hg15 (ACCDON), internal exons from the Burset and Guigó and Rogic et al. human gene sets (BGROIEXONS) and splice, start and stop sites from RefSeq-hg16 not present in the Burset and Guigó and Rogic et al. human gene sets (NOBGRORS).
DATASET is available from this link: /datasets/splidlbns2004/
All the programs and data used to identify selenoproteins in the human genome. Seven novel selenoprotein genes were found by SECIS and gene prediction, together with comparative genomics approaches. We believe the human selenoproteome to consist of 17 selenoprotein families (15kDa, DI, GPX, SelH, SelI, SelK, SelM, SelN, SelO, SelP, SelR, SelS, SelT, SelV, SelW, SPS2 and TR) and, in addition, two Cys-containing homologs (MsrA and SelU), which are selenoproteins in other organisms.
DATASET is available from this link: /datasets/sphuman2003/
In this site we describe all the programs and data presented in Guigó et al, PNAS 2003. In that paper we estimated that near a thousand novel human genes that do not overlap known proteins can be verified experimentally. The method is based in the comparison of human and mouse genomes to enhance the resulting gene-predictions, plus a filtering step from which a sample of mouse predictions were tested by RT-PCR amplification and direct sequencing.
DATASET is available from this link: /datasets/mouse2002/
Supplementary materials for the SGP2
paper are available from this section. SGP2
is a gene prediction pogram that combines ab initio gene prediction with TBLASTX
searches between two genome sequences to provide both sensitive and specific gene predictions.
DATASET is available from this link: /datasets/sgp2002/
In this site we describe all the programs and data used to predict selenoproteins in the Drosophila melanogaster genome. Two novel selenoprotein families (SelK and SelH, previously named SelG and SelM) were found by coordination of gene and SECIS prediction. In addition, the fly genome is know to contain the SPS2 selenoprotein.
DATASET is available from this link: /datasets/spdroso2001/
A database (SpliceDB) of known mammalian splice site sequences has been developed. Weight matrices were built for the major splice groups, which can be incorporated into gene prediction programs.
SpliceDB is available at the computational genomic Web server of the Sanger Center and has a mirror site at SoftBerry. [Burset, Seledtsov and Solovyev, Nucleic Acid Research 29(1):255-259 (2001)]
Given the absence of experimentally verified large genomic data sets, we constructed an semi-artificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions in order to analize gene-prediction programs accuracy.
DATASET is available from this link: /datasets/gpeval2000/
A set of training sequences (exons/introns) and the resulting parameters required to run geneid on Drosophila melanogaster genome.
DATASET is available from this link: /datasets/Dro_me/
A number of computer programs for the prediction of gene structure in DNA genomic sequences are analyzed. The programs are tested in a large set of vertebrate sequences.
DATASET is available from this link: /datasets/genomics96/