| |
|
The subject of my PhD is the computational prediction of Selenoprotein genes in diverse eukaryotic genomes.
My broader research interests include systems biology,evolution of protein and gene families and the RNA world in general.
I am particularly fascinated by examples of RNA elements playing a regulatory role in the genome.
Specifically, my PhD focuses on the development and application of computational methods for the prediction of selenoprotein genes in eukaryotic genomes.
Selenoproteins are a diverse family of genes which are characterised by the presence of selenocysteine, the 21st amino acid. Selenocysteine is
co-translationally inserted into the growing polypeptide chain in response to the opal codon UGA. UGA is normally a stop codon and it's interpretation
as a sense codon in selenoprotein genes depends on the presence of a stem-loop structure on the 3' UTR, the SECIS (Selenocysteine
Insertion Element) element. This element is recognised by the Secis Binding Protein 2 (SBP2) which then recruits the
Sec-specific elongation factor EFsec.
The presence of the in-frame UGA codon makes it impossible for selenoprotein genes to be recognised by normal gene prediction methods. At best,
these will predict a truncated protein with the UGA as a STOP codon or at worst, they will be missed alltogether.
There are two basic approaches for the prediction of selenoprotein genes :
- SECIS based
- SECIS independent
The SECIS based approach depends on the prediction of potential SECIS elements across the genome and the subsequent search for TGA-containing
ORFs at a suitable distance downstream.
The SECIS independent approach on the other hand rests on the assumption that genes are conserved across different species. Therefore,
finding potential TGA-containing ORFs which are conserved between different species and whose conservation extends across the TGA codon is indicative
of selenoprotein function.
Both of these methods are explained in greater detail in the papers cited in the publications section. For a review of selenoprotein genes
please see :
- K. Caban and P. R. Copeland
Size matters: a view of selenocysteine
incorporation from the ribosome Cell. Mol. Life Sci. 63 (2006) 0073-0081
[Abstract]
[Full Text]
6
- S. Gromer, J. K. Eubel, B. L. Lee and J. Jacob
Human selenoproteins at a glance
CMLS, Cell. Mol. Life Sci. Nov; 62(21):2414-37 (2005).
[Abstract]
[Full Text]
| |
|
The following are a few mostly simple perl scripts that I have written and find useful. Please feel free to use and/or
modify and redistribute them under the terms of the GNU General Public Licence (GPL). For those of you
who are computer scientists, please remember that I am a biologist and that my
education is reflected in my programming style. They are not pretty, but they work.
- SECISaln
SECISaln will predict a eukaryotic SECIS element in a nucleotide sequence, split it into its structural units and then align each unit against the SECISes in our database. SECISaln can distiguish between typeI and typeII SECIS elements and will align the submitted sequence against others of the same type. All sequences used by SECISaln have been collected from either GenBank or EGO
SECISaln is not intended to replace SECISearch as a SECIS element predictor. In fact, SECISaln uses SECISearch to predict SECIS elements. The objective of this tool is to provide researchers with an easy way to compare structural features of SECIS elements. It should only be used on sequences known to contain a SECIS element. The pattern used by SECISaln to recognise SECIS elements is very permissive and would result in false positives when run on unknown sequences.
If you find this software useful please cite Chapple C. E. et al, Bioinformatics, 2009 25(5):674-675
- alignthingie.pl
Description: Simple perl script to parse a blast outfile and return those HSPs where a specific
residue in the query line is found aligned to another (or the same) specific residue in the subject
line if that hsp meets certain e-value, identity, score and conservation criteria. It can return HSPs
or gffs of the HSPs and sports a variety of options for selection of hits.
USAGE : alignthingie.pl [options] blast.out
COMMAND-LINE OPTIONS:
-c : Minimum number of conserved residues (or '+') allowed around the matche
d residue (integer, def : 6)
-C : Also check for Cs in the subject sequence which align to a '*' in the query
-e : Maximum e-value allowed (integer, def : 10)
-i : Minimum (i)dentity percentage allowed (integer, def : 0)
-I : Maximum (I)dentity percentage allowed (integer, def : 100)
-l : Minimum number of conserved residues (or '+') allowed on the (l)eft side of the matched residue
(def : 3)
-r : Minimum number of conserved residues (or '+') allowed on the (r)ight side of the matched residue
(def : 3)
-M : (M)aximum score value allowed (integer, def : 10000)
-m : (m)inimum score value allowed (integer, def : 0)
-q : String to match in query (def : '*' )
-s : String to match in subject (def : '*')
-R : Length of (R)egion around matched residue to check for conservation (integer, def : 6)
(R-matched residue-R)
-S : Use strict evalue, id, score (0.01,65,50 respectively) and conservation cuttoffs.
Writing the cons cuttoffs too long, just run "alignthingie.pl -Sd" to see...
-k : Do not check conservation
-u : How many (u)naligned residues are allowed with respect to query length.
( - <= )
OUTPUT OPTIONS:
-A : Return (A)ll HSPs which pass thresholds without looking for any specific aligned residues
-b : Print only the (b)est (lowest e-value) hit for each query. If the smallest evalue
is shared by more than one HSP, all such HSPs will be printed.
-B : Print only the (B)est (lowest e-value) hit for each query. If the smallest evalue
is shared by more than one HSP, only the first such HSP will be printed.
-d : (d)ebugging mode, very very verbose...
-f : No sel(f) : Skips subjects whose name matches (case-INsensitive) the value passed. (string)
-F : Generalised no sel(F), takes first chars (till first space, OR value passed) of query and subj names and
skips the hit if the 2 are identical.
-g : (g)ff output. Use "-g q" for guery position gff and "-g s" for subject gff.
-L : Print most (L)ikely hits. ie, those with no stop codon before the matched residue and whose
conservation on the right side of the match is no more than 2 less than that of the left side.
(this option is really only useful for selenoprotein searches)
-n : Print only the names of those queries which did NOT return any HSP.
-p : Do not return hits against (p)lant species.
-Q : (Q)uery name or list of names (text file, one name per line) to return HSPs for. Only those HSPs whose
query is specified will be printed
-T : (T)arget (subject) name or list of names (text file, one name per line) to return HSPs for.
Only those HSPs whose subject is specified will be printed
-v : (v)erbose output, prints a . for each query processed.
-V : more (V)erbose output, prints a . for each query processed and a ! for each hit found.
-x : Only return those hits with a redox box CXXU/*
-X : Read a list of species names and ignore hits between the same species.
-U : Query name or list of (U)nwanted queries. Quoted list of query names (or text file, one name per line)
for which NOT to return HSPs.
- retrieveseqs.pl
Description : This script will take a list(s) of sequence names either as separate files or
as options on the command line and retrieve their sequences from a multi fasta file.
USAGE: retrieveseqs.pl [-viofsn] <FASTA sequence file> <desired IDs, one per line>
COMMAND-LINE OPTIONS:
-v : verbose output, print a progress indicator (a "." for every 1000 sequences processed)
-V : as above but a "!" for every desired sequence found.
-f : fast, takes first characters of name "(/^([^\s]*)/)" given until the first space as the search string
make SURE that those chars are UNIQUE.
-i : use when the ids in the id file are EXACTLY identical
to those in the FASTA file
-h : Show this help and exit.
-o : will create one fasta file for each of the id files
-s : will create one fasta file per id
-n : means that the last arguments (after the sequence file)
passed are a QUOTED list of the names desired.
- SECISearch
Description: SECISearch uses patscan
and RNAfold to scan
nucleotide sequences for SECIS elements and evaluate them thermodynamically respectively. You will need to download and install
both of these programs to uise SECISearch. By default, it prints the SECISes found to STDOUT in fasta format. If no
pattern is specified by the -p option, SECISearch uses its own default pattern. It was originally
developed as a web-based application by Gregory Kryukov and I have since modified it to work on the
command-line. For more information and for the web-based interface please visit the
Gladyshev lab.
The download is a linux tarbal. Minimal editing of the code is needed to provide the correct
locations of the necessary executables on your system. Please contact me if you have trouble downloading/installing this script.
USAGE : SECISearch [OPTIONS] -p [patscan pattern file] [FASTA input file]
COMMAND-LINE OPTIONS:
-c : Patscan will search the complimentary strand
-d : Debugging mode (very very very verbose)
-e : Free energy cut-off value of the entire structure
-t : Return <FILENAME>.<PATNAME>.hits
-g : Create <FILENAME>.<PATNAME>.gff
-G : Create gff in geneid format (score=1), use with -g or -o gff.
-h : Display this message and exit
-H : Produce HTML output
-I : Do not return images
-l : Print structures which did not pass the thermodynamic evaluation to the logfile.
-o : Choice of STDOUT output format, can be 'fs' for fasta
or 'gff' for GFF. The default output is fasta.
-n : Return <FILENAME>.en
-p : Pattern file passed to scan_for_matches. If '-p' is not given, the standard
pattern is used. Possible values are 's' for a standard pattern, 'n' for a
non-standard pattern, and 't' for a "twilight zone" pattern. Any bareword
value is assumed to be one of the patterns stored in the ~/share/SECISearch
directory (or wherever else the patterns happen to be stored on your system).
Any directory address will be assumed to be the location of a user defined
pattern file. CAUTION: SECISearch will not warn you if you give a non-existant
filename as a pattern, so check your pattern name before running.
-P : Do not run patscan. (input file must be in the format of patscan's output)
-R : Do not run RNAfold
-s : Do NOT return <FILENAME>.<PATNAME>.secis, FASTA file of SECIS elements
-T : Print total number of SECIS found
-v : Verbose output
-x : Create the log file "secislog"
SECIS Structural Feature Options (ON by default, use these options to turn the filters OFF):
-B : Discards SECISes with at least 2 more unpaired nts on the 5'
side than on the 3' side (visually bended to the left)
-O : Discards SECISes with more than 2 consecutive,
unpaired nts on any strand in the first 7nt after the quartet
-Y : Discards Y-shaped SECISes
-S : Discards structures with less than 8 pairs
- Articles:
- The Bovine Genome Sequencing and Analysis Consortium (including C.E. Chapple and R. Guigó),
The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution.
Science,
324(5926):522-528 (2009)
[Abstract] [PDF] [PubMed] [SUPPLEMENTARY]
- Takeuchi A, Schmitt D, C.E. Chapple, Babaylova E, Karpova G, R. Guigó , Krol A, Allmang C.
A short motif in Drosophila SECIS Binding Protein 2 provides differential binding affinity to SECIS RNA hairpins.
Nucleic Acids Res. 2009 Feb 17. [Epub ahead of print]
- C.E. Chapple, R. Guigó and A. Krol
"SECISaln, a web-based tool for the creation of structure-based alignments of eukaryotic SECIS elements."
Bioinformatics. 2009 Mar 1;25(5):674-5. Epub 2009 Jan 29. [Abstract] [PDF] [PubMed] [SUPPLEMENTARY]
- C.E. Chapple and R. Guigó
"Relaxation of selective constraints causes independent selenoprotein extinction in insect genomes."
PLoS ONE 3(8):e2968 (2008) [Abstract] [PDF] [PubMed]
- C. Manichanh, C.E. Chapple, L. Frangeul, R. Guigó and J. Dore
"A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library."
Nucleic Acids Res. 2008 Sep;36(16):5180-8. Epub 2008 Aug 5. [Abstract] [PDF] [PubMed]
- A.G. Clark et al.
"Evolution of genes and genomes on the Drosophila phylogeny."
Nature 450(7167):203-18 (2007) [Abstract] [PDF] [PubMed]
- S. Castellano, A.V. Lobanov, C. Chapple, S.V. Novoselov, M. Albrecht, D. Hua, A. Lescure, T. Lengauer, A. Krol, V.N. Gladyshev and
R. Guigó
"Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family."
PNAS, 102(45):16188-16193 (2005) [Abstract]
[From the cover]
[PubMed]
[Commentary by R.J. Stillwell and M.J. Berry]
- K. Taskov, C. Chapple, G.V. Kryukov, S. Castellano, A.V. Lobanov, K.V. Korotkov, R. Guigó and V.N. Gladyshev
Nematode selenoproteome: the use of selenocysteine insertion system to decode one codon in an animal genome?
Nucleic Acids Research, 33:2227-2238 (2005) [Abstract]
[Full Text]
- International Tetraodon Genome Sequencing Consortium
Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype
Nature, 431:946-957 (2004)
[Abstract]
[Full Text]
[Datasets]
- Posters:
- S. Castellano, C. Chapple and R. Guigó
Annotation of Eukaryotic Selenoproteins: Finding the Needle in the Haystack
The Biology of Genomes, CSHL, New York (USA) (2004)
[Gzipped PostScript file]
|
|