Genefinding: annotation of genomic DNA sequences

In this exercise, we will download and install a genefindinding software on our computer and use it to annotate a genomic DNA sequence. We will compare the results with the results given by other gene prediction programs and with the annotation of the sequence.

Part 1: installing and running geneid


Try the next steps:

  1. Go to the GBL web.

  2. Select the Software link

  3. Select geneid and then, geneid homepage

  4. Have a look around the page to answer these questions...:
    • If you have problems with the program, what can you do?
    • Imagine you would like to see an example of geneid output before testing it, you should go to...
    • You are going to download the program, what do you have to do?

  5. Let us now download the program:
    • Get the geneid v 1.2. full distribution (save the file geneid_v1.3.8.Oct_08_2007.tar.gz)

    • The file is compressed, try in your terminal:
      tar -zxvf geneid_v1.3.8.Oct_08_2007.tar.gz

    • Type cd geneid/include

      since geneid's parameters are set for whole genomes, we will have to change them in order to make it work on our computer (otherwise the memory requirements would be too big). Open the file geneid.h with a text editor (e.g. vior emacs) and change the following parameters:

    • #define MAXEVIDENCES 1000
    • #define MAXNSEQUENCES 10

    • now type cd .. and make

    • Type:   bin/geneid -h

    • Take a look at the list of options

  6. Go to the EMBL database

    • Click on Access and SRS
    • Type the accession number for this sequence U30787

    • Press the Search button
    • Click on the EmblEntry link

    • Have a look at the different entry fields: find the mRNA and CDS fields.

    • Click on Text Entry link to see the plain text output
    • Save the sequence in your directory, either from the EMBL data base or from this link: HS307871.fa

  7. Now, we will run geneid with different parameters
    • The default command line for geneid is the following:

      bin/geneid -P param/human3iso.param HS307871.fa

    • Look for the parameters to make geneid predict the signals: donor and accpeptor sites, start and stop codons.

    • Predict all signals individually and compare the output with the real annotation of the sequence.

    • Use the unix utilities pipe (|) and grep to facilitate searching for specific coordinates of the annotation in the myriad of predictions.

    • Now, predict all exons and search for the real ones in the predictions. Finally, predict the gene (default parameters) and compare it with the annotation.

    Figure 1. Signals, exons and genes predicted by geneid in the sequence HS307871
  8. Reannotation from experimental results:
    From the first practice, you have observed a lack of accuracy when predicting the first exon of the gene (1107..1126).
    • Can you verify whether geneid is actually building this exon or not by running geneid to predict exons? (hint: look for the option to do this)

    • The solution for the previous step was:
      bin/geneid -xGP param/human3iso.param HS307871.fa | grep 1107

    • Let's imagine this exon has been experimentally tested and then we'll try to improve the prediction with it. Take a look at this exon.

    • Reannotation process. Type and analyze the current solution:
      bin/geneid -P param/human3iso.param -R exon.gff HS307871.fa

  9. Advantages of working in a Unix like system
    Using geneid in command line together with some Unix programs (grep, wc, gawk, sort, ...), we can easily parse geneid output to answer the following questions:

    • A - How many putative exons does geneid predict on this sequence ?

    • B - And how many putative acceptors sites ?

    • C - Which is the start codon with the highest score between the coordinates 500 and 1500 ?

    • D - How many putative exons does geneid predict that contain the GGGGG motive at aminoacid level?

    • E - Which is the longest Single exon gene predicted by geneid ? And the shortest ?


## Solutions:

  • A
    bin/geneid -xoP param/human3iso.param HS307871.fa | grep -v "#" | wc

    3616

  • B
    bin/geneid -oaP param/human3iso.param HS307871.fa | grep -v "#" | wc
    368

  • C
    bin/geneid -obP param/human3iso.param HS307871.fa | gawk '{if ($2>500 && $3<1550) print}' | sort -rnk 4

    Start 998 1000 3.87 - CCAAGAGCGTCGCCATGTTG

  • D
    bin/geneid -oxP param/human3iso.param HS307871.fa | grep GGGGG | wc

    18

  • E
    bin/geneid -osP param/human3iso.param HS307871.fa | sort -rnk 12

    Single 2104 2163 -6.26 - 0 0 0.00 -1.75 -0.51 0.00 20
    Single 827 1174 -7.24 + 0 0 -1.27 0.00 -3.70 0.00 116




Part 2: other ab intitio gene prediction programs

Given that there are several alternative programs to analyze a DNA sequence, we can run every application and observe the common parts of the predictions.

  1. GENSCAN:
    • Go to the GENSCAN server

    • Paste the DNA sequence

    • Press the Run Genscan button

    • Compare the annotations and predictions

  2. FGENESH:
    • Go to the Softberry homepage

    • In the left frame, select GENE FINDING in Eukaryota

    • Select the program FGENESH

    • Paste the DNA sequence

    • Press the Search button

    • Compare the annotations and predictions
  3. GRAIL:
    • Go to the GrailEXP homepage

    • Activate the Perceval Exon Candidates box

    • Paste the DNA sequence

    • Press the Go! button

    • Check the results

    • Compare the annotations and predicted exons