Genome BioInformatics Research Lab

 IMIM * UPF * CRG * GRIB HOME PEOPLE * Charles Chapple
 
me



OK, quick CV. My name is Charles (Karolos) Chapple, I'm half greek/half american (one parent from each of these nationalities) and grew up in Greece. I was born in Athens in 1980 and I left that sunny place to go to England where, in 2002, I graduated from York University with a BsC in Biology and a serious dislike of rain. In my effort to live somewhere where I would not be rained on 5 days out of every 7, I came to Barcelona and joined the Guigó lab as an intern in December 2002. Here, I discovered the wonderful world of computational gene prediction in general and of selenoproteins in particular and began my PhD under R. Guigó's supervision in September 2003.

Bug me: charles.chapple@crg.es



 
Contents

Research interests


The subject of my PhD is the computational prediction of Selenoprotein genes in diverse eukaryotic genomes. My broader research interests include systems biology,evolution of protein and gene families and the RNA world in general. I am particularly fascinated by examples of RNA elements playing a regulatory role in the genome.

Specifically, my PhD focuses on the development and application of computational methods for the prediction of selenoprotein genes in eukaryotic genomes. Selenoproteins are a diverse family of genes which are characterised by the presence of selenocysteine, the 21st amino acid. Selenocysteine is co-translationally inserted into the growing polypeptide chain in response to the opal codon UGA. UGA is normally a stop codon and it's interpretation as a sense codon in selenoprotein genes depends on the presence of a stem-loop structure on the 3' UTR, the SECIS (Selenocysteine Insertion Element) element. This element is recognised by the Secis Binding Protein 2 (SBP2) which then recruits the Sec-specific elongation factor EFsec.

The presence of the in-frame UGA codon makes it impossible for selenoprotein genes to be recognised by normal gene prediction methods. At best, these will predict a truncated protein with the UGA as a STOP codon or at worst, they will be missed alltogether.

There are two basic approaches for the prediction of selenoprotein genes :

  • SECIS based
  • SECIS independent

The SECIS based approach depends on the prediction of potential SECIS elements across the genome and the subsequent search for TGA-containing ORFs at a suitable distance downstream.

The SECIS independent approach on the other hand rests on the assumption that genes are conserved across different species. Therefore, finding potential TGA-containing ORFs which are conserved between different species and whose conservation extends across the TGA codon is indicative of selenoprotein function.

Both of these methods are explained in greater detail in the papers cited in the publications section. For a review of selenoprotein genes please see :

  • K. Caban and P. R. Copeland
    Size matters: a view of selenocysteine incorporation from the ribosome
    Cell. Mol. Life Sci. 63 (2006) 0073-0081  [Abstract]   [Full Text]   6
  • S. Gromer, J. K. Eubel, B. L. Lee and J. Jacob
    Human selenoproteins at a glance
    CMLS, Cell. Mol. Life Sci. Nov; 62(21):2414-37 (2005).   [Abstract]   [Full Text]  

 
Software

The following are a few mostly simple perl scripts that I have written and find useful. Please feel free to use and/or modify and redistribute them under the terms of the GNU General Public Licence (GPL). For those of you who are computer scientists, please remember that I am a biologist and that my education is reflected in my programming style. They are not pretty, but they work.

  • SECISaln


    SECISaln will predict a eukaryotic SECIS element in a nucleotide sequence, split it into its structural units and then align each unit against the SECISes in our database. SECISaln can distiguish between typeI and typeII SECIS elements and will align the submitted sequence against others of the same type. All sequences used by SECISaln have been collected from either GenBank or EGO

    SECISaln is not intended to replace SECISearch as a SECIS element predictor. In fact, SECISaln uses SECISearch to predict SECIS elements. The objective of this tool is to provide researchers with an easy way to compare structural features of SECIS elements. It should only be used on sequences known to contain a SECIS element. The pattern used by SECISaln to recognise SECIS elements is very permissive and would result in false positives when run on unknown sequences.

    If you find this software useful please cite Chapple C. E. et al, Bioinformatics, 2009 25(5):674-675

  • alignthingie.pl


    Description: Simple perl script to parse a blast outfile and return those HSPs where a specific residue in the query line is found aligned to another (or the same) specific residue in the subject line if that hsp meets certain e-value, identity, score and conservation criteria. It can return HSPs or gffs of the HSPs and sports a variety of options for selection of hits.

      USAGE : alignthingie.pl [options] blast.out 
    
      COMMAND-LINE OPTIONS:
    
        -c : Minimum number of conserved residues (or '+') allowed around the matche
    d residue (integer, def : 6)
        -C : Also check for Cs in the subject sequence which align to a '*' in the query
        -e : Maximum e-value allowed (integer, def : 10)
        -i : Minimum (i)dentity percentage allowed (integer, def : 0)
        -I : Maximum (I)dentity percentage allowed (integer, def : 100)
        -l : Minimum number of conserved residues (or '+') allowed on the (l)eft side of the matched residue
             (def : 3)
        -r : Minimum number of conserved residues (or '+') allowed on the (r)ight side of the matched residue
             (def : 3)       
        -M : (M)aximum score value allowed (integer, def : 10000)
        -m : (m)inimum score value allowed (integer, def : 0)
        -q : String to match in query (def : '*' )
        -s : String to match in subject     (def : '*')
        -R : Length of (R)egion around matched residue to check for conservation (integer, def : 6)
             (R-matched residue-R)   
        -S : Use strict evalue, id, score (0.01,65,50 respectively) and conservation cuttoffs.
             Writing the cons cuttoffs too long, just run "alignthingie.pl -Sd" to see...
        -k : Do not check conservation
        -u : How many (u)naligned residues are allowed with respect to query length.  
             ( -  <= )
    
    OUTPUT OPTIONS:
        -A :  Return (A)ll HSPs which pass thresholds without looking for any specific aligned residues
        -b :  Print only the (b)est (lowest e-value) hit for each query. If the smallest evalue
              is shared by more than one HSP, all such HSPs will be printed.
        -B :  Print only the (B)est (lowest e-value) hit for each query. If the smallest evalue
              is shared by more than one HSP, only the first such HSP will be printed.
        -d : (d)ebugging mode, very very verbose...
        -f :  No sel(f) : Skips subjects whose name matches (case-INsensitive) the value passed. (string)
        -F :  Generalised no sel(F), takes first chars (till first space, OR value passed) of query and subj names and 
              skips the hit if the 2 are identical.
        -g : (g)ff output. Use "-g q" for guery position gff and "-g s" for subject gff.
        -L :  Print most (L)ikely hits. ie, those with no stop codon before the matched residue and whose
              conservation on the right side of the match is no more than 2 less than that of the left side.
              (this option is really only useful for selenoprotein searches)
        -n :  Print only the names of those queries which did NOT return any HSP.
        -p :  Do not return hits against (p)lant species.
        -Q : (Q)uery name or list of names (text file, one name per line) to return HSPs for. Only those HSPs whose 
              query is specified will be printed
        -T : (T)arget (subject) name or list of names (text file, one name per line) to return HSPs for. 
              Only those HSPs whose subject is specified will be printed
        -v : (v)erbose output, prints a . for each query processed.
        -V :  more (V)erbose output, prints a . for each query processed and a ! for each hit found.
        -x : Only return those hits with a redox box CXXU/*
        -X : Read a list of species names and ignore hits between the same species.
        -U : Query name or list of (U)nwanted queries. Quoted list of query names (or text file, one name per line)
             for which NOT to return HSPs.
    
    
  • retrieveseqs.pl


    Description : This script will take a list(s) of sequence names either as separate files or as options on the command line and retrieve their sequences from a multi fasta file.

      USAGE: retrieveseqs.pl [-viofsn] <FASTA sequence file> <desired IDs, one per line>
      
      COMMAND-LINE OPTIONS:
        -v : verbose output, print a progress indicator (a "." for every 1000 sequences processed)
        -V : as above but a "!" for every desired sequence found.
        -f : fast, takes first characters of name "(/^([^\s]*)/)" given until the first space as the search string
             make SURE that those chars are UNIQUE.
        -i : use when the ids in the id file are EXACTLY identical
             to those in the FASTA file
        -h : Show this help and exit.
        -o : will create one fasta file for each of the id files
        -s : will create one fasta file per id
        -n : means that the last arguments (after the sequence file)
             passed are a QUOTED list of the names desired.
    

  • SECISearch


    Description: SECISearch uses patscan and RNAfold to scan nucleotide sequences for SECIS elements and evaluate them thermodynamically respectively. You will need to download and install both of these programs to uise SECISearch. By default, it prints the SECISes found to STDOUT in fasta format. If no pattern is specified by the -p option, SECISearch uses its own default pattern. It was originally developed as a web-based application by Gregory Kryukov and I have since modified it to work on the command-line. For more information and for the web-based interface please visit the Gladyshev lab.

    The download is a linux tarbal. Minimal editing of the code is needed to provide the correct locations of the necessary executables on your system. Please contact me if you have trouble downloading/installing this script.

      USAGE  : SECISearch [OPTIONS] -p [patscan pattern file] [FASTA input file]
        
      COMMAND-LINE OPTIONS:
        -c : Patscan will search the complimentary strand
        -d : Debugging mode (very very very verbose)
        -e : Free energy cut-off value of the entire structure
        -t : Return <FILENAME>.<PATNAME>.hits
        -g : Create <FILENAME>.<PATNAME>.gff
        -G : Create gff in geneid format (score=1), use with -g or -o gff.
        -h : Display this message and exit
        -H : Produce HTML output
        -I : Do not return images
        -l : Print structures which did not pass the thermodynamic evaluation to the logfile.
        -o : Choice of STDOUT output format, can be 'fs' for fasta 
             or 'gff' for GFF. The default output is fasta. 
        -n : Return <FILENAME>.en
        -p : Pattern file passed to scan_for_matches. If '-p' is not given, the standard 
             pattern is used. Possible values are 's' for a standard pattern, 'n' for a 
             non-standard pattern, and 't' for a "twilight zone" pattern. Any bareword 
             value is assumed to be one of the patterns stored in the ~/share/SECISearch
             directory (or wherever else the patterns happen to be stored on your system).
             Any directory address will be assumed to be the location of a user defined 
             pattern file. CAUTION: SECISearch will not warn you if you give a non-existant
             filename as a pattern, so check your pattern name before running.
        -P : Do not run patscan. (input file must be in the format of patscan's output) 
        -R : Do not run RNAfold
        -s : Do NOT return <FILENAME>.<PATNAME>.secis, FASTA file of SECIS elements
        -T : Print total number of SECIS found
        -v : Verbose output
        -x : Create the log file "secislog"
    
    SECIS Structural Feature Options (ON by default, use these options to turn the filters OFF):
        
        -B : Discards SECISes with at least 2 more unpaired nts on the 5'
             side than on the 3' side (visually bended to the left)
        -O : Discards SECISes with more than 2 consecutive,
             unpaired nts on any strand in the first 7nt after the quartet
        -Y : Discards Y-shaped SECISes 
        -S : Discards structures with less than 8 pairs
    
    
Publications

  • Articles:

    • The Bovine Genome Sequencing and Analysis Consortium (including C.E. Chapple and R. Guigó),
      The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution.
      Science, 324(5926):522-528 (2009) [Abstract]   [PDF]   [PubMed]  [SUPPLEMENTARY]  

    • Takeuchi A, Schmitt D, C.E. Chapple, Babaylova E, Karpova G, R. Guigó , Krol A, Allmang C.
      A short motif in Drosophila SECIS Binding Protein 2 provides differential binding affinity to SECIS RNA hairpins.
      Nucleic Acids Res. 2009 Feb 17. [Epub ahead of print]

    • C.E. Chapple, R. Guigó and A. Krol
      "SECISaln, a web-based tool for the creation of structure-based alignments of eukaryotic SECIS elements."
      Bioinformatics. 2009 Mar 1;25(5):674-5. Epub 2009 Jan 29.  [Abstract]   [PDF]   [PubMed]   [SUPPLEMENTARY]  

    • C.E. Chapple and R. Guigó
      "Relaxation of selective constraints causes independent selenoprotein extinction in insect genomes."
      PLoS ONE 3(8):e2968 (2008)   [Abstract]   [PDF]   [PubMed]  

    • C. Manichanh, C.E. Chapple, L. Frangeul, R. Guigó and J. Dore
      "A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library."
      Nucleic Acids Res. 2008 Sep;36(16):5180-8. Epub 2008 Aug 5.  [Abstract]   [PDF]  [PubMed]  

    • A.G. Clark et al. "Evolution of genes and genomes on the Drosophila phylogeny."
      Nature 450(7167):203-18 (2007) [Abstract]   [PDF]   [PubMed]  

    • S. Castellano, A.V. Lobanov, C. Chapple, S.V. Novoselov, M. Albrecht, D. Hua, A. Lescure, T. Lengauer, A. Krol, V.N. Gladyshev and R. Guigó
      "Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family."
      PNAS, 102(45):16188-16193 (2005)   [Abstract]   [From the cover]   [PubMed]   [Commentary by R.J. Stillwell and M.J. Berry]

    • K. Taskov, C. Chapple, G.V. Kryukov, S. Castellano, A.V. Lobanov, K.V. Korotkov, R. Guigó and V.N. Gladyshev
      Nematode selenoproteome: the use of selenocysteine insertion system to decode one codon in an animal genome?
      Nucleic Acids Research, 33:2227-2238 (2005)   [Abstract]   [Full Text]

    • International Tetraodon Genome Sequencing Consortium
      Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype
      Nature, 431:946-957 (2004)   [Abstract]   [Full Text]   [Datasets]






  • Posters:
    • S. Castellano, C. Chapple and R. Guigó
      Annotation of Eukaryotic Selenoproteins: Finding the Needle in the Haystack
      The Biology of Genomes, CSHL, New York (USA) (2004)
      [Gzipped PostScript file]


 
  Disclaimer webmaster