Genefinding: annotation of genomic DNA sequences
In this exercise, we will download and install a genefindinding software on our computer and use it to annotate a genomic DNA sequence. We will compare the results with the results given by other gene prediction programs and with the annotation of the sequence.
Part 1: installing and running geneid
Try the next steps:
- Go to the GBL web.
- Select the Software link
- Select geneid and then, geneid homepage
- Have a look around the page to answer these questions...:
- If you have problems with the program, what can you do?
- Imagine you would like to see an example of geneid output before testing it, you should go to...
- You are going to download the program, what do you have to do?
- Let us now download the program:
- Get the geneid v 1.2. full distribution (save the file geneid_v1.3.8.Oct_08_2007.tar.gz)
- The file is compressed, try in your terminal:
tar -zxvf geneid_v1.3.8.Oct_08_2007.tar.gz
- Type cd geneid/include
since geneid's parameters are set for whole genomes, we will have to change them in order to make it work on our computer (otherwise the memory requirements would be too big). Open the file geneid.h with a text editor (e.g. vior emacs) and change the following parameters:
- #define MAXEVIDENCES 1000
- #define MAXNSEQUENCES 10
- now type cd .. and make
- Type: bin/geneid -h
- Take a look at the list of options
- Go to the EMBL database
- Click on Access and SRS
- Type the accession number for this sequence U30787
- Press the Search button
- Click on the EmblEntry link
- Have a look at the different entry fields: find the mRNA and CDS fields.
- Click on Text Entry link to see the plain text output
- Save the sequence in your directory, either from the EMBL data base or from this link: HS307871.fa
- Now, we will run geneid with different parameters
- The default command line for geneid is the following:
bin/geneid -P param/human3iso.param HS307871.fa
- Look for the parameters to make geneid predict the signals: donor and accpeptor sites, start and stop codons.
- Predict all signals individually and compare the output with the real annotation of the sequence.
- Use the unix utilities pipe (|) and grep to facilitate searching for specific coordinates of the annotation in the myriad of predictions.
- Now, predict all exons and search for the real ones in the predictions. Finally, predict the gene (default parameters) and compare it with the annotation.
Figure 1. Signals, exons and genes predicted by geneid in the sequence HS307871 - Reannotation from experimental results:
From the first practice, you have observed a lack of accuracy when predicting the first exon of the gene (1107..1126).
- Can you verify whether geneid is actually building this exon or not by running geneid to predict exons? (hint: look for the option to do this)
- The solution for the previous step was:
bin/geneid -xGP param/human3iso.param HS307871.fa | grep 1107
- Let's imagine this exon has been experimentally tested and then we'll try to improve the prediction with it. Take a look at this exon.
- Reannotation process. Type and analyze the current solution:
bin/geneid -P param/human3iso.param -R exon.gff HS307871.fa
- Advantages of working in a Unix like system
Using geneid in command line together with some Unix programs (grep, wc, gawk, sort, ...), we can easily parse geneid output to answer the following questions:
- A - How many putative exons does geneid predict on this sequence ?
- B - And how many putative acceptors sites ?
- C - Which is the start codon with the highest score between the coordinates 500 and 1500 ?
- D - How many putative exons does geneid predict that contain the GGGGG motive at aminoacid level?
- E - Which is the longest Single exon gene predicted by geneid ? And the shortest ?
## Solutions:
- A
bin/geneid -xoP param/human3iso.param HS307871.fa | grep -v "#" | wc3616
- B
bin/geneid -oaP param/human3iso.param HS307871.fa | grep -v "#" | wc
368
- C
bin/geneid -obP param/human3iso.param HS307871.fa | gawk '{if ($2>500 && $3<1550) print}' | sort -rnk 4Start 998 1000 3.87 - CCAAGAGCGTCGCCATGTTG
- D
bin/geneid -oxP param/human3iso.param HS307871.fa | grep GGGGG | wc18
- E
bin/geneid -osP param/human3iso.param HS307871.fa | sort -rnk 12Single 2104 2163 -6.26 - 0 0 0.00 -1.75 -0.51 0.00 20
Single 827 1174 -7.24 + 0 0 -1.27 0.00 -3.70 0.00 116
Part 2: other ab intitio gene prediction programs
Given that there are several alternative programs to analyze a DNA sequence, we can run every application and observe the common parts of the predictions.
- GENSCAN:
- Go to the GENSCAN server
- Paste the DNA sequence
- Press the Run Genscan button
- Compare the annotations and predictions
- FGENESH:
- Go to the Softberry homepage
- In the left frame, select GENE FINDING in Eukaryota
- Select the program FGENESH
- Paste the DNA sequence
- Press the Search button
- Compare the annotations and predictions
- GRAIL:
- Go to the GrailEXP homepage
- Activate the Perceval Exon Candidates box
- Paste the DNA sequence
- Press the Go! button
- Check the results
- Compare the annotations and predicted exons
- NOTE: the first exon is always missed in the predictions and there are some problems to detect the donor site from exon 5. The detection of Start codons is a serious drawback in current gene finding programs (see Figure 2). However, this problem can be overcome by using homology information to complete the gene prediction.
![]()
Figure 2. EMBL annotation and genes predicted by Grail, GENSCAN, geneid and FGENESH in the sequence HS307871
- Because of time constraints, in this practical, it is impossible to practise with all gene finding programs available. However, we would like to point to the option of using protein or nucleic acid seqeunces (e.g. ESTs) in order to improve predictions. Protein or nucleic acid sequences can be aligned to genomic seqeunces using exonerate. Exonerate produces a so-called spliced alignment, allowing us to identify the exons and introns in the genomic DNA sequence.
- Another (fast) way to align proteins/nucleic acids to genomic DNA sequences is using blat at the UCSC genome browser. The results will be displayed in the genome browser, together with the information available from the genome browser for the particular genomic location (known genes, ESTs, etc). As a simple exercise, past the protein predicted by geneid in the blat form. Check the results.
Josep F. Abril, Enrique Blanco, Sergi Castellano, Genis Parra and Roderic Guigó © 2002