Genefinding: running a program on your computer


Try the next steps:

  1. Connect to the GBL web.

  2. Select Software

  3. Select geneid and then, geneid homepage

  4. Take a look around the page to answer these questions...:
    • If you have problems with the program, what can you do?
    • Imagine you would like to see any example of geneid output before testing it, you should go to...
    • You are going to download the program, what do you have to do?

  5. Let's download the program:
    • Get the geneid v 1.1. Full distribution (save the file geneid_v1.1.Feb_26_2003.tar.gz)

    • The file is compressed, try on your terminal:
      tar -zxvf geneid_v1.1.Feb_26_2003.tar.gz

    • Type cd geneid, and then make

    • Type:   bin/geneid -h

    • Take a look at the list of options

    • Save the sequence on your directory: HS307871.fa

    • Run the gene prediction:
      bin/geneid -P param/human3iso.param HS307871.fa

    • Add the option -v and try to discover how it works

    • Compare the prediction to the annotated gene

  6. Reannotation from experimental results:
    From the first practice, you have observed a lack of accuracy when predicting the first exon of the gene (1107..1126).
    • Can you verify whether geneid is actually building this exon or not by running geneid to predict exons? (hint: look for the option to do it)

    • The solution for the previous step was:
      bin/geneid -xGP param/human3iso.param HS307871.fa | grep 1107

    • Let's imagine this exon has been experimentally tested and then, we'll try to rebuild the prediction with it. Take a look at this exon.

    • Reannotation process. Type and analyze the current solution:
      bin/geneid -P param/human3iso.param -R exon.gff HS307871.fa

  7. Advantages of working in a Unix like system
    Using geneid in command line together with some Unix programs (grep, wc, gawk, sort, ...), we can easily parse geneid output to answer the following questions:

    • A - How many putative exons does geneid predict on this sequence ?

    • B - And how many putative acceptors sites ?

    • C - Which is the start codon with the highest score between the coordinates 500 and 1500 ?

    • D - How many putative exons does geneid predict that contain the GGGGG motive at aminoacid level?

    • E - Which is the longest Single exon gene predicted by geneid ? And the shortest ?


## Solutions:

  • A
    bin/geneid -xoP param/human3iso.param HS307871.fa | grep -v "#" | wc

    3612

  • B
    bin/geneid -oaP param/human3iso.param HS307871.fa | grep -v "#" | wc
    368

  • C
    bin/geneid -obP param/human3iso.param HS307871.fa | gawk '{if ($2>500 && $3<1550) print}' | sort +3n

    Start 998 1000 3.87 - CCAAGAGCGTCGCCATGTTG

  • D
    bin/geneid -oxP param/human3iso.param HS307871.fa | grep GGGGG | wc

    18

  • E
    bin/geneid -osP param/human3iso.param HS307871.fa | sort +11n

    Single 2104 2163 -6.26 - 0 0 0.00 -1.75 -0.51 0.00 20
    Single 827 1174 -7.24 + 0 0 -1.27 0.00 -3.70 0.00 116




Josep F. Abril, Enrique Blanco, Sergi Castellano, Genis Parra and Roderic Guigó © 2002