DATASETS: in silico identification of selenoproteins

Genome Informatics Research Lab

Resources & Datasets Gene Predictions | Seminars & Courses

IMIM

UPF

CRG

GRIB

DATASETS

Selenoprotein prediction

Analysis of the draft sequence of the compact Tetraodon nigroviridis genome provides new insights into vertebrate evolution

Jaillon et al. (including S. Castellano, G. Parra, C. Chapple and R. Guigó^*).

Submitted (2004) [Abstract] [Full Text] [Genome Browser]

^*To whom correspondence should be adressed in relation to selenoproteins.

Contents

In this site we describe and provide all the programs and data used used to predict selenoproteins within the Tetraodon genome annotation effort. The known selenoprotein families were reannotated and, in addition, novel ones were searched for.

Selenoproteins overview
Comparative genomics
Genome sequences
Annotation protocol
Reannotation protocol
geneid
genewise
spidey
SECISearch
Reannotated selenoproteins
Novel selenoproteins
Genome Browser

Selenoproteins overview

Major points on selenoproteins are:

They incorporate the aa selenocysteine (U or Sec) which is the 21st aa. It has its own tRNA which carries the anticodon for UGA (which we were taught was only a STOP codon !).
So, why not all UGA codons code for Sec? because the alternative decoding of UGA is conferred by an mRNA secondary structure, termed SECIS. This structure, by means of one or more proteins, directs the ribosome to incorporate Sec.
They are everywhere: Eukarya, Bacteria and Archaea. But the SECIS element is located in the 3' UTR in eukaryotes and archaeas while in the coding region in bacterias (just after the UGA). Eukarya, Bacteria and Archaea SECIS elements differ substantially.
Try standard gene prediction and as much you will get are truncated selenoprotein genes. Why not accepting that UGA can code for Sec and then use a variety of computational techniques able to handle this codon ambiguity? We use gene prediction tools with the capacity of predicting genes with a TGA in-frame and, in addition, protein and transcript alignments against the genomic sequence to refine gene structures. Finally, the SECIS element is also predicted with a dedicated computational algorithm. At the same time, false positive predictions can be sorted out with comparative genomics approaches. This is the work presented here.

Comparative genomics

According to the Human Genome Project, Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome. Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans.

In this case, sequence conservation across a TGA codon, between two or more genes from different species, strongly suggests selenocysteine coding function. In addition, SECIS elements are also conserved at the primary sequence level when compared at the right phylogenetic distance.

Genome sequences

For this work, we used the initial assembly of the Tetraodon sequence described in this very same paper. It consists of 25,773 scaffolds covering 342 Mb (including gaps). The Tetraodon genome size is estimated to be between 329 and 356 Mb (average 340).

Annotation protocol

The annotation process refers to the description and location of genes and other biologically relevant features on a genomic sequence. That is, to define the particular genomic coordinates (nucleotide position along a DNA sequence) of the biological element of interest e.g. a gene or a promoter element.

In the Tetraodon genome project, genes were annotated in the following way:

Three vertebrate genomes (human, mouse and Takifugu) had been totally or partially sequenced prior to Tetraodon, providing a catalogue of vertebrate predicted protein coding genes to guide annotation of those present in this pufferfish genome. This data was exploited using Exofish, a tool that identifies evolutionary conserved regions (ecores) with a high specificity. In particular, Human and mouse IPI proteins were compared to the Tetraodon assembly using Exofish.
Human and mouse IPI proteins that matched with Exofish were also aligned on Tetraodon using genewise.
The Exofish tool was then used to compare the entire human and mouse genomic sequences to the Tetraodon genome. Additional ecores, not identified through proteome comparisons, were found.
The Exofish tool was also used to compare the Takifugu and Tetraodon genome assemblies. This step had a higher sensitivity to detect genes than its equivalent with mammalian sequences.
To further increase the the possibility of identifying Tetraodon genes that are not conserved in any of the aforementioned genomes, and to refine annotation of predicted homologs the ends of about 155,000 Tetraodon cDNA clones from 7 different tissues were sequenced.
In addition, two ab initio gene prediction programs were used. genscan and geneid were trained on manually annotated Tetraodon genes and provided additional and/or complementary gene predictions.
Finally, all these predicted sequence segments were combined with the GAZE algorithm to provide the final annotation of predicted gene models on the Tetraodon genome sequence.

However, this genome annotation pipeline DID NOT annotate correctly selenoprotein genes.

Reannotation protocol

The selenoprotein reannotation protocol was the following:

Map all known human selenoprotein families onto the predicted gaze genes (tblastn against virtual cDNAs) to find selenoprotein gene models that need reannotation.
Map all known human selenoprotein families onto the genomic sequence (tblastn against scaffolds) to search for additional selenoprotein homologs not predicted.
Collect gff annotation for each gaze gene model likely to be a selenoprotein.
Map collected gaze genes corresponding to selenoproteins to experimental transcript sequences (cDNA or ESTs).
Predict selenoprotein genes on cDNA data with geneid and genewise (human selenoproteins are used as homologs).
Predict selenoprotein genes on genomic data with spidey (using Tetraodon transcripts. UTRs can be defined in this way) and genewise (predicted Tetraodon selenoproteins in cDNAs and human ones are used as homologs)
Predict SECIS elements on cDNA and ESTs data with SECISearch 2.0 and then map them back to the genome with blastn.
Modify gff annotation for each gaze gene model based on new gene structures predicted and include the SECIS information.

geneid

Please, for a general introduction browse the geneid page . The modified geneid version able to predict selenoproteins can be found just below (source code in ansi C and some parameters file):

geneid_SP.tar.gz

The parameters file is an external flat file read by geneid at running time. Take a look at it!. It carries the statistical information, for a given organism, used to predict genes and the gene model (which states the relationships of the exons predicted along a sequence). Please, read the geneid handbook for details. The following file is the parameter file for Tetraodon and Takifugu:

tetraodon.param.3.No_SECIS.1.8.param

genewise

genewise is a program to align a related protein with a DNA sequence and retrieve the optimal coding gene structure. Splice sites are taken into account which makes the definition of exons quite robust.

spidey

spidey is an mRNA-to-genomic alignment program. It also takes into account the splice site signals to find exons and, because it is aligning transcript sequences, can provide non-coding exons (5' and 3' UTRs).

SECISearch

The SECIS structure, located in the 3' UTR in both eukaryotic and archaea mRNAs, is the secondary/tertiary RNA structure which directs the UGA codon recoding. Eukaryotic and Archaea SECIS structure differ substantially.

SECISearch 2.0 identifies candidate SECIS elements in nucleotide sequence databases on the basis of their primary sequence structure and predicted free energy criteria. An online version of the program was used. Available canonical and non-canonical patterns were ran on selected fugu candidates.

Reannotated selenoproteins

In what follows, the gene structure (in gff format), the protein sequence and the SECIS element divided into structural units are given for each selenoprotein family in the Tetraodon genome.

Novel selenoproteins

As a result of the analysis of this puffer fish genome, a novel selenoprotein family was predicted. It is currently under experimental study in Dr. Gladyshev's lab, which independently made the same prediction.

Genome browser

The Tetraodon genome browser provides a graphical overview of the current state of gene annotation, among other features, in this genome. Selenoprotein gene structures (including the SECIS elements when available) are annotated. In addition of these visualization capabilities, annotations can be downloaded in the standard GFF format for sharing sequence feature information.

Disclaimer

webmaster