Annotation of the human sequence HS307871

Practical exercise

Enrique Blanco - eblanco@imim.es
Roderic Guigo - rguigo@imim.es


Abstract: In this exercise, a previously annotated gene will be used to measure the accuracy of different gene finding approaches. GRAIL, GENSCAN, geneid, FGENESH, GenomeScan, GrailEXP and GENEWISE will be used to annotate the sequence. Search by signal, content and homology (protein and cDNA sequences) methods will be employed in order to improve the ab initio results. Weak conservation of Start codons will lead to wrong prediction of initial exons in most cases.

Colour legend:
  • Genomic element
  • Operations or links

  • A. GENE ANNOTATION

    Step 1. Accesing EMBL database to retrieve the gene
    • Go to the EMBL database

    • Select Access and SRS

    • Type the accession number for this sequence U30787

    • Press the Search button

    • Click on the Embl:Entry link

    • Have a look at the different entry fields: find the mRNA and CDS fields.

    • Go to the EMBL Fetch box on the EMBL index page to see the plain text output

    • This is the sequence in FASTA format

    B. EXPLORING AB INITIO GENE PREDICTION

    Step 2. Running geneid
    • Go to the geneid server

    • Paste the FASTA sequence

    • Choose the geneid output format

    • Run geneid with different parameters:

      1. Searching for signals: Select acceptors, donors, start and stop codons. Look for them in the real annotation of the sequence

      2. Searching for exons: Select All exons and try to find the real ones

      3. Finding genes: You do not need to select any option (default behaviour). Compare the predicted gene with the real gene

      Figure 1. Signals, exons and genes predicted by geneid in the sequence HS307871


    Step 3. Running other genefinders

    Given that there are several alternative programs to analyze a DNA sequence, we can run a series of applications and observe the common parts of the predictions.

    1. GENSCAN:
      • Go to the GENSCAN server

      • Paste the DNA sequence

      • Press the Run Genscan button

      • Compare the annotations and predictions

    2. FGENESH:
      • Go to the Softberry homepage

      • In the left frame, select GENE FINDING in Eukaryota

      • Select the program FGENESH

      • Paste the DNA sequence

      • Press the Search button

      • Compare the annotations and predictions

    3. GRAIL:
      • Go to the GrailEXP homepage

      • Activate the Perceval Exon Candidates box

      • Paste the DNA sequence

      • Press the Go! button

      • Check the results

      • Compare the annotations and predicted exons

    4. NOTE: the first exon is always missed in the predictions and there are some problems to detect the donor site from exon 5. The detection of Start codons is a serious drawback in current gene finding programs (see Figure 2). However, this problem can be overcome by using homology information to complete the gene prediction.
    Figure 2. EMBL annotation and genes predicted by Grail, GENSCAN, geneid and FGENESH in the sequence HS307871


    C. USING EST/cDNA HOMOLOGY INFORMATION

    Step 4. Using GrailEXP
    • Go to the GrailExp homepage

    • Activate the Galahad EST/mRNA/cDNA Alignments box

    • Select the GrailEXP database (RefSeq/HTDB/dbEST/EGAD/Riken)

    • Activate exon assembly: Gawain Gene Models

    • Paste the DNA sequence

    • Press the Go! button

    • Check the results: predictions and supporting information

    • Compare annotations, ab initio GRAIL prediction and five predicted alternatively spliced variants
    Figure 3. Comparison between EMBL annotation and genes predicted ab inition by Grail Vs five alternative predictions supported by ESTs information in the sequence HS307871



    Step 5. Using other gene finding programs + alignment of transcripts

    Using blastn, we can search the database est_human for ESTs supporting future predictions. Filter this output in order to select those non-overlapping ESTs that could form a complete cDNA sequence (see Figure 4). Moreover, ESTs not divided into two or more pieces in the genomic sequence (containing a couple of splice sites) should be rejected.
    • Go to the FGENESH-C server (in Gene finding with similarity menu)

    • Paste the sequence HS307871

    • Paste the cDNA sequence or EST you have selected

    • Press the search button

    • Note that predicted genes will necessarily be supported by homology information, so it will, most likely, only map to the genomic region overlapping your EST query.

    Figure 4. Best human ESTs in the alignment mapped on the genomic sequence HS307871

    D. Using protein homology information

    Step 6. Spliced alignment

    Spliced alignment is very useful when we have additional information (a putative homologous protein sequence) about the content of the sequence. Gene prediction is then guided by fitting the protein sequence into the best splice sites predicted in the genomic sequence.
    • Open the NCBI blast server

    • Choose the blastx program (genomic query versus protein database)

    • Paste the genomic sequence and press the Blast! and Format! buttons

    • Select second protein (the first one seems to be a truncated isoform). Display the FASTA sequence or click here. Obviously, it is the real protein annotated in the genomic sequence.

    • Open the genewise web server to use this protein to predict the best gene structure

    • Paste both the protein and genomic sequence and run the program

    • Compare the predicted gene (end of the file) and annotations: look for splice sites within introns to check exon boundaries are correct

      Figure 5. Best HSPs representing proteins homologues similar to the genomic sequence HS307871 obtained using blastx



    Step 7. Spliced alignment using homologous proteins

    From the blastx output, choose several homologous genes and run genewise for each one separately, again. Observe the increase of accuracy when the homologue used is closer to the original human protein:

    Step 8. Spliced alignment using an homologous cDNA

    As with protein homology information, we can also use an homologous cDNA to do a spliced alignment.
    • Go to the Spidey web server.
    • Select the homologous mouse cDNA.
    • Paste both the human genomic and the mouse cDNA sequence in the appropriate fields.
    • Presss the Align button
    • Check the results

    Step 9. Using protein homology information: GenomeScan

    Protein homology information can also be used to enhance ab initio predicted exons supported by blastx HSPs as in the case of GenomeScan and geneid, thereby improving the final prediction GenomeScan:
    • Go to the GenomeScan web server

    • Retrieve the protein from the previous blast search

    • Paste both genomic and protein sequences

    • Press the button GenomeScan

    • Check the results. It seems that the first exon has not been detected even using homology information. This is due to the fact that blast programs have a minimal word length.

    Figure 6. GenomeScan output: first exon is not correctly predicted probably due to blast word length restrictions


    E. USING A GENOME ANNOTATION BROWSER

    Step 9. Golden path archive:
    • Open the UCSC Genome Bioinformatics Site

    • Select the blat link to locate the genomic coordinates of our sequence

    • Paste the DNA sequence in FASTA format (HS307871)

    • Submit the query

    • Click on the first hit: (browser link)

    • Compare the graphical annotation with the EMBL entry of the gene

    • Analyze these different sets of output options:
      Genes and Gene Prediction Tracks,
      mRNA and EST Tracks

    Figure 7. (a) UCSC genome browser representation of the region containing the gene uroporphyrinogen decarboxylase (URO-D) (b) UCSC genome browser representation of the context (100Kbps) region around the gene uroporphyrinogen decarboxylase (URO-D).

    F. RESULTS

    Here you can find the solutions for every exercise:

    EMBL annotation
    EMBL annotation (plain text)
    FASTA sequence
    geneid results: signals
    geneid results: exons
    geneid results: genes
    GENSCAN results
    FGENESH results
    GRAIL results
    GrailEXP results
    Blastn + human ESTs results
    Blastx + protein results
    Genewise (human protein)
    Genewise (ovis protein)
    Genewise (mouse protein)
    Genewise (rat protein)
    Genewise (Danio rerio protein)
    Genewise (Drosophila melanogaster protein)
    Genewise (Drosophila virilis protein)
    Genewise (yeast protein)
    Genewise (fission yeast protein)
    GenomeScan results


    G. BIBLIOGRAPHY
    1. J.F. Abril and R. Guigó. gff2ps: visualizing genomic annotations. Bioinformatics 16:743-744 (2000).

    2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215:403-410 (1990).

    3. Burge, C. and Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94 (1997).

    4. E. Blanco, G. Parra and R. Guigó. Using geneid to Identify Genes. In A. D. Baxevanis and D. B. Davison, chief editors: Current Protocols in Bioinformatics. Volume 1, Unit 4.3. John Wiley & Sons Inc., New York. ISBN: 0-471-25093-7 (2002).

    5. G. Parra, E. Blanco, and R. Guigó. Geneid in Drosophila. Genome Research 10:511-515 (2000).

    6. Asaf A. Salamov and Victor V. Solovyev. Ab initio Gene Finding in Drosophila Genomic DNA Genome Res. 10: 516-522 (2000).

    7. Yeh, R.-F., Lim, L. P. and Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816 (2001).

    8. D. Hyatt, J. Snoddy, D. Schmoyer, G. Chen, K. Fischer, M. Parang, I. Vokler, S. Petrov, P. Locascio, V. Olman, Miriam Land, M. Shah, and E. Uberbacher. Improved Analysis and Annotation Tools for Whole-Genome Computational Annotation and Analysis: GRAIL-EXP Genome Analysis Toolkit and Related Analysis Tools. Genome Sequencing & Biology Meeting (2000).

    9. Ewan Birney and Richard Durbin. Using GeneWise in the Drosophila Annotation Experiment. Genome Res. 10: 547-548 (2000).