Genome BioInformatics Research Lab |
Help | News | People | Research Software Publications | Links |
Resources & Datasets | Gene Predictions | Seminars & Courses |
|
geneid Training Tutorial |
||||||
IMPORTANT: The user interested in training geneid should have access to a unix/linux working environment and have at least basic knowledge of linux command line instructions, BASH shell scripting and the UNIX-based computer languages gawk and perl . Furthermore, some programs are C-based and require compilation. Instructions on how to compile should be contained within the program packages. Finally it is likely that the user will have to make gawk and perl scripts executable after downloading them. This can be done by running the following linux command: chmod 755 filename.awk[pl]. The user should also check that the different "shebangs" for gawk, perl and bash indicate the proper path of the different executables and are at the top of every script. For example "#!/usr/bin/perl" is known as the perl shebang. It is put at the top of a perl script, and it tells the web server (like Apache) where to find the perl executable. .
NOTE: For Perkinsus marinus we used 765 sequences as part of the training set and 250 gene models to test the newly developed parameter file. 1)In the first step of training process the CDS and intronic regions (in fasta format) of the each of the annotated gene models are extracted from the fasta files given their annotation coordinates (gff file). The CDS and intonic fastas can be obtained by using the in-house program SSgff (tar-gzipped version). The following wrapper BASH script gives an example of how the CDS and intron sequences can be extracted from the genomic (training) fastas above: #!/bin/bash mkdir -p ./intron mkdir -p ./cds mkdir -p ./fastas cd ./fastas/ #obtain multi -fastas TblToFastaFile pmarinus4training.tbl while read genomic_id gene_id do echo $genomic_id echo $gene_id egrep -w "$gene_id\$" pmarinus4training.gff > /tmp/$$gff SSgff -cE ../fastas/$genomic_id /tmp/$$gff \ | sed -e 's/:/_/' -e 's/ CDS//' \ > ../cds/$gene_idNOTE: "locus_id_training" contains the list of all sequences and genes (fields $1 and $9 of the gff file). In the example above the genomic sequences are placed in a directory "fastas" and we the output CDS and intron fastas are output into directories "CDS" and "Intron". A series of 765 single or multi-fasta files will be created in each of these 2 directories which should be concatenated into single multi-fasta files. It will also be necessary to convert these files to tabular format which can be done using the following awk script: FastaToTbl. Finally the CDS sequences should be converted to proteins and the user should look for any sequences with in-frame stops and remove the from the training/evaluation set. Nucleotide CDS sequences can be converted to protein sequences using the Translate script (To run Translate the user will also need genetic.code). The user will also use TblToFastaFile to convert the multi-fasta file into separate fasta files. . The following wrapper BASH script lists the steps described above: #!/bin/bash # Convert fasta files to tabular fasta FastaToTbl all.cds.fa > all.cds.tbl FastaToTbl all.intron.fa > all.intron.tbl # Translate the CDS mkdir./protein ; cd ./protein; FastaToTbl ./cds/* | Translate | TblToFastaFile # Checking for stop codons in the coding region FastaToTbl./protein/* | grep "[A-Z]\*[A-Z]" |\ gawk '{print $1}' | uniqIn a second step start and stop codons and all splice sites of each of the annotated genes given the fasta files and gff coordinates again using SSgff (tar-gzipped version). The following wrapper BASH script gives an example of how the splice sites and start/stop codons can be extracted from the genomic (training) fastas above: #!/bin/bash mkdir -p ./sites while read genomic_id gene_id do echo $genomic_id echo $gene_id egrep -w $gene_id pmarinus4training.gff > /tmp/$$gff SSgff -dabeE ../fastas/$genomic_id /tmp/$$gff \ > /tmp/$$all_sites for site in Acceptor Donor Stop Start do echo $site grep -A 1 $site /tmp/$$all_sites | sed '/--/d' \ >> ../sites/${site}_sites done done < locus_id_trainingNOTE: "locus_id_training" contains the list of all sequences and genes (fields $1 and $9 of the gff file). In the example above the genomic sequences are placed in a directory "fastas" and we the output CDS and intron fastas are output into directories "sites". Four multi-fasta files will be created for each type of site (i.e. "Donor_sites"). These files should be converted to tabular format as described above using FastaToTbl. It is also suggested that the user filter out those splice sites and start codons that are not canonical from the training set. This can be done by employing the following BASH script (example for Donor sites): #!/bin/bash #####Donors gawk '{print $2}' ./Donor_sites.tbl |\ egrep -v '^[atcgNATCGn]{31}GT' > seqs_Don #11 out of 4277 non-standard donors egrep -f seqs_Don Donor_sites.tbl | gawk '{print $1}' - \ | sort | uniq > seqs_wrong_Don egrep -vf seqs_wrong_Don Donor_sites.tbl > Donor_sites_filter1.tbl #4266 canonical donor sitesIn the following step of the training process the user will derive frequency matrices for the donors, acceptors and start codons. However to do this the user will first have to parse the sequences (normally the sequences in which the gene models are embedded) for background dinucleotides ("GT","AG" or "ATG") flanked by approximately 30 nucleotides of sequence on each side. The awk program "FindRegexp" and the BASH script "BoundarySites.sh" can be used to obtain these background sequences in an appropriate format. Furthermore in order to exclude the real donors, acceptors and starts from the background the user can use a combination of awk and linux commands (as shown below). It should be pointed out that this program can be very slow if we are dealing with large fasta files. The following BASH script demonstrates how the user can obtain the background data for "GT", "AG" and "ATG": #!/bin/bash ############# Obtaining all the GT sites FindRegexp GT < pmarinus4training.tbl \ | join pmarinus4training.tbl - | BoundarySites.sh -2 31 \ | gawk '{for (i=2;i<=NF;i+=2) print $1"_"$i,$(i+1)}' | sort \ > all.GTs #981,326 # Obtaining the false donor sites gawk '{print $1"_"$5+1}' pmarinus4training.gff | sort \ | join -v 1 all.GTs - \ > all.false_GTs.tbl #979,004 ############# Obtaining all the AG sites FindRegexp AG < pmarinus4training.tbl\ | join pmarinus4training.tbl - | BoundarySites.sh +4 28 \ | gawk '{for (i=2;i<=NF;i+=2) print $1"_"$i,$(i+1)}' | sort \ > all.AGs #1,089,374 # Obtaining the false acceptor sites gawk '{print $1"_"$4-2}' pmarinus4training.gff | sort \ | join -v 1 all.AGs - \ > all.false_AGs.tbl #1,087,071 ############# Obtaining all the ATG sites FindRegexp ATG < pmarinus4training.tbl \ | join pmarinus4training.tbl - | BoundarySites.sh +2 30 \ | gawk '{for (i=2;i<=NF;i+=2) print $1"_"$i,$(i+1)}' | sort \ > all.ATGs #304,920 # Obtaining the false Start sites gawk '{print $1"_"$4}' pmarinus4training.gff | sort \ | join -v 1 all.ATGs - \ > all.false_ATGs.tbl #304,456 If the user goes through the training protocol described in the BASH scripts of the previous section she should now have the following files required in subsequent steps:
The files above should be all the user needs can to compute the markov models or PWMs for the splice sites and for deriving a model for coding DNA. Once the user has obtained a large set of background ATG, GT and AGs ("false" start codon and splice sites) and extracted the real splice sites and start codon profiles from the annotated sequences as described in the previous section she can proceed to calculate the frequencies of nucleotides at the positions surrounding the different sites (false and real) by using the awk script frequency.awk. Then the user should determine the range around the sites which is informative (i.e the positional frequency of nucleotides flanking the sites is significantly different from the same positional frequencies in "false" background sites) by running another awk script: information.awk. This will give the user an indication of the range of nucleotides around the splice sites and start codon which are usefule to use in building the position weight matrices (PWMs). Furthermore the information content within the real splice sites and start codon profiles can be displayed as a graph (using the pictogram (tar-gzipped version) tool), in which the nucleotide height at any given position is proportional to the frequency of the base at that position with the information content being displayed as bit scores. The following example shows the Donor site profile diagram the user would obtain for P. marinus. This example shows that the Donor profile includes nucleotides carrying significant information from positions -2 to +3 (with reference to the GT dinucleotide). Having calculated the frequencies of all real and background sites the user can proceed to obtain PWMs for the splice sites and start codon. PWMs are calculated as log-likelihood ratios of the probability of observing a nucleotide at a given position in a real site over the probability of observing the same nucleotide at the same position in a "false" site (as described in more detail elsewhere). PWMs are computed using the awk script aux4.awk. Furthermore, using another awk script (files.awk) the user can restrict the PWM to the range of nucleotides found to be informative (see above) and save it in a format that can be directly introduced into the newly developed geneid parameter file for this species. The following BASH script exemplifies what has been described in the above paragraphs on how to compute a PWM for a donor site. PWMs for acceptor sites and start codons were obtained in a similar way. #!/bin/bash ############ Computing frequencies and the amount of information #### background GTs frequency.awk 1 all.false_GTs.tbl > freq.GTs.1 #### real donor sites frequency.awk 1 Donor_sites_filter2.tbl > freq.donors.1 #### checking for significant positions to be subsequently used in PWM information.awk freq.GTs.1 freq.donors.1 | gawk 'NF==2' | more #positions 30-36 seem to be the most useful (informative) to build the PWM #### Plot graph showing the the proportion of nucleotides around the donor site gawk '{print substr($2,25,15)}' Donor_sites_filter2.tbl > Donor_sites.sub pictogram Donor_sites.sub Donor -bits -land #### COMPUTE THE PWM for the DONOR SITE gawk -f aux4.awk freq.GTs.1 freq.donors.1 > log_mat.donors files.awk 30 36 log_mat.donors > donor.geneid If the user follows the the steps above and calculates PWMs for each of the two splice sites and start codon she will generate the following files that can be introduced directly into the newly developed parameter file:
Once the user has computed PWMs for all splice sites and start codon she can proceed to derive a model for the coding potential. The model for coding DNA will be a Markov of order 4 or 5 depending on the amount of coding/non-coding data available. In order to estimate the number of coding and non-coding nucleotides de user can run the following BASH script: #!/bin/bash gawk '{ l=length($2); L+=l; print l} END{ print "TOTAL: "L;}' all.cds.tbl #800,322 (markov5) gawk '{ l=length($2); L+=l; print l} END{ print "TOTAL: "L;}' all.intron.tbl #364,591 (markov5) For this species we were provided with enough data to compute a more accurate Markov model of order 5 which was estimated on both exonic (all.cds.tbl) and intronic (all.intron.tbl) sequences. For the exon sequences we estimate the transition probability distribution of each nucleotide given the pentanucleotide preceding it for each of the three possible frames and an initial probability matrix from the pentamers observed at each codon position using the awk program MarkovMatrices.awk. For the intron sequences a single transition matrix was computed as well as a single initial probabilty matrix using the awk script MarkovMatrices-noframe.awk. Once the Initial and Transition probability matrices are calculated the user proceed to compute the coding potential. The coding potential is computed as log-likelihood ratio (as described in more detail elsewhere) using the the awk programs pro2log_ini.awk (ratios of Initial probability matrices between exonic and intronic sequences) and pro2log_tran.awk (ratios of Transition probability matrices between exonic and intronic sequences). The user should also open the actual programs to check for running options. The following BASH script employs a combination of the scripts indicated above, awk and linux commands that will allow the user to compute the coding potential in format which can be introduced into the geneid parameter file for this species. #!/bin/bash ########################## # Building coding statistic (Markov model order 5) #Generating the initial and transition matrices # without stop codon ################# #markov5 ################## gawk '{print $1,substr($2,1,length($2)-3)}' all.cds_filter1.tbl | gawk -f MarkovMatrices.awk 5 set1.cds sort +1 -2 -o set1.cds.5.initial set1.cds.5.initial sort +1 -2 -o set1.cds.5.transition set1.cds.5.transition gawk -f MarkovMatrices-noframe.awk 5 set1.intron all.intron_filter1.tbl sort -o set1.intron.5.initial set1.intron.5.initial sort -o set1.intron.5.transition set1.intron.5.transition ## Compute log-likelihood exon matrices, assuming intron ## matrices describing background probabilities gawk -f pro2log_ini.awk set1.intron.5.initial set1.cds.5.initial \ > set1.cds-intron.5.initial gawk -f pro2log_tran.awk set1.intron.5.transition set1.cds.5.transition \ > set1.cds-intron.5.transition gawk 'BEGIN {p=-1}{if (((NR+2) % 3)==0) p+=1; print $2,p,$1,$3}' \ set1.cds-intron.5.initial > set1.cds-intron.5.initial.geneid gawk 'BEGIN {p=-1}{if (((NR+2) % 3)==0) p+=1; print $2,p,$1,$4}' \ set1.cds-intron.5.transition > set1.cds-intron.5.transition.geneid If the user goes through the training protocol described above of she should up with two coding potential probability matrix files which are ready to be inserted into the parameter file for this species:
At this stage the user can start building an initial (un-optimized) version of the parameter file for Perkinsus marinus using the previously computed PWMs and the coding potential data. The user can obtain a "template" of a geneid parameter file here. At this point the user can start introducing the start codon and splice site profiles into the appropriate places in the template file. It should be noted that at present the syntax has to be strictly followed. The start site profile (starts.geneid) should be inserted after the comment "# These are the start transition probabilities". Furthermore each of the profiles has four parameters that should be considered when inserting them; for example for the start profile # Parameters to Predict Sites: Lenght, Offset, Cutoff and order (Markov chain) Start_profile 13 8 -5 0 "10" corresponds to the length of the profile. "5" corresponds to the offset position. "-5" is the score cutoff threshold meaning that scores below this value are not considered. "0" is the order. If the user had computed a Markov chain or order 1 or 2 to model the splice sites she would place a "1" or "2" in this position. For PWMs the order is always "0". NOTE: In geneid the offset is calculated in relation to the exon (not intron). This can be more easily explained by looking at the following examples of each one of the profiles: HOW TO CALCULATE THE OFFSET FOR THE DIFFERENT SITES: #start profile: 1 2 3 4 5 6 7 8 9 10 11 12 13 C T A C A G A T G A T A G In the example above the hypothetical start profile has 13 positions, the exon starts at position 7 (ATG) and the previous intron ends at position 6. Therefore the offset in this example would be "6" (before first position of exon). #donor profile: 1 2 3 4 5 6 7 8 9 10 11 A G C G T G G T A G C In the example above the hypothetical donor profile has 11 positions, the intron starts at position 7 (GT) and the exon ends at position 6. Therefore the offset in this example would be "5" (before last position of exon). #acceptor profile: 1 2 3 4 5 6 7 8 9 10 11 12 C T A C A G A G A T A G In the example above the hypothetical acceptor profile has 12 positions, the exon starts at position 9 (AG) and the intron ends at position 8. Therefore the offset in this example would be "8" (before first position of exon) Following the rules above, the different sites profile parameters for Perkinsus marinus should be: # Parameters to Predict Sites: Lenght, Offset, Cutoff and order (Markov chain) Start_profile 13 8 -5 0 # These are the start transition probabilities Acceptor_profile 15 13 -5 0 # These are the acceptor transition probabilities Donor_Profile 7 1 -5 0 # These are the donor transition probabilities Once the user has inserted the different PWM profiles following the instructions shown above she can proceed to insert the markov chain matrices intot the appropriate places in the parameter file. But first the user must indicate the Markov order number below "Markov_oligo_logs_file" in the parameter file (in this case "5" for a Markov model of this order). The user should then insert the Markov Initial and Transition probability matrices (set1.cds-intron.5.initial.geneid and set1.cds-intron.5.transition.geneid) below the text "Markov_Initial_probability_matrix" and "Markov_Transition_probability_matrix" respectively. # Parameters to Predict Exons: Markov chains(initial - transition) Markov_oligo_logs_file 5 Markov_Initial_probability_matrix Markov_Transition_probability_matrix The parameter file has another section at its very end ("Gene model") which when optimized can improve the gene prediction by increasing the stringency of the gene assemblies.For example for Perkinsus marinus when know from information extracted from the training set that 99% of 5076 introns range between 21 and 1600. Therefore the user can force geneid to only assemble genes within these limits by setting low and high boundaries for intron length below "INTRAgenic connections" that are similar to the range of values observed in the training set. It could also prove useful to set a minimum (and maximum) intergenic distance (below "INTERgeneic connections") especially in gene-dense organisms. The values the user can insert for this species are shown below (in bold): # Global Parameters # GENE MODEL: Rules about gene assembling (GenAmic) General_Gene_Model # INTRAgenic connections First+:Internal+ Internal+:Terminal+ 15:2000 block Terminal-:Internal- First-:Internal- 15:2000 blockr # External features Promoter+ First+:Single+ 50:4000 Terminal+:Single+ aataaa+ 50:4000 First-:Single- Promoter- 50:4000 aataaa- Single-:Terminal- 50:4000 # INTERgeneic connections aataaa+:Terminal+:Single+ Single+:First+:Promoter+ 150:Infinity aataaa+:Terminal+:Single+ Single-:Terminal-:aataaa- 150:Infinity Promoter-:First-:Single- Single+:First+:Promoter+ 150:Infinity Promoter-:First-:Single- Single-:Terminal-:aataaa- 150:Infinity # BEGINNING and END of prediction Begin+ First+:Internal+:Terminal+:Single+ 0:Infinity Begin- First-:Internal-:Terminal-:Single- 0:Infinity First+:Internal+:Terminal+:Single+ End+ 0:Infinity First-:Internal-:Terminal-:Single- End- 0:Infinity At this point the user will have a working Perkinsus marinus parameter file albeit not yet optimized. There are two parameters internal to the parameter file of geneid (elaborated in more detail elsewhere) which can be modified to optimize gene predictions. These parameters (shown here) basically set cutoffs to the the scores of the predicted exons to minimize overprediction (eWF) and set the ratio of information from signal and coding statistics used (oWF).A more detailed explanation can be found elsewhere in our group's page. "Optimization" basically consists of evaluating the geneid predictions on the training set against the training set annotations themselves and picking the combination of oWF and eWF that optimize the prediction accuracy as measured by accuracy at nucleotide, exon and gene level. Very briefly, "evaluate" gives the user a measure of the "sensitivity" or the proportion of annotated nucleotides, exons or genes that are predicted correctly and the "specificity" or the proportion of the predicted nucleotides/exons/genes which are true-positive (i.e.) fall within the annotations; a more detailed explanation can be found elsewhere). The evaluation tool we use to can be downloaded from here (tar-gzipped version). "Optimization" requires some processing of the input files using a combination of linux and awk command line instructions as well as the awk tool gff2cds. The steps involved in setting up the optimization process are shown in the BASH script below: #!/bin/bash ###obtain list of training set fastas containing annotated genes with their respective length gawk '{print $1,length($2)}' pmarinus4training.tbl | sort | uniq \ > cds_tmp4training ###convert gff-format gene annotations into "cds" format using gff2cds tool gff2cds source="P.marinus" pmarinus4training.gff \ | sort | join cds_tmp4training - > pmarinus4training.cds ### set up gff file for compatibility with evaluation tool gawk 'BEGIN{while (getline< ARGV[1]>0){len[$1]=$2;};ARGV[1]="";OFS="\t";} \ {if (NR==1) {ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",".","."}; \ if ($1!=ant) {print "#$";ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",\ ".","."}; print }' pmarinus4training.cds pmarinus4training.gff > pmarinus4training.eval.gff If the user follows the setup steps shown above she should end up with the following file: pmarinus4training.eval.gff, which will be necessary in the optimization process. The optimization process itself requires an AWK BASH-embedded script. This script (Optimization_new.sh) basically allows the user to go through an user-selected range of oWF and eWF. Geneid is run and the resulting predictions evaluated for each combination of oWF and eWF (against the training set annotations). Before running this script the user should edit the "Optimization_new.sh" file to make sure that the $PATH of both geneid ($GENEID) and the evaluation ($EVAL) program are correctly indicated (in bold): #section of Optimization_new.sh: . . . while [ $of -ne 0 ] do IeWF=$IeWFini ef=`gawk "BEGIN{print ($IeWF <= $FeWF) ? 1 : 0}"` while [ $ef -ne 0 ] do # update exon score in geneid.param.multi gawk "{if (NR==27) {print 1-$IoWF+0.2, 1-$IoWF-0.1, 1-$IoWF-0.1, 1-$IoWF+0.2;} else if (NR==30) {print $IoWF, $IoWF, $IoWF, $IoWF;} \ else if (NR==36) {print $IeWF, $IeWF, $IeWF, $IeWF;} else print}" $PARAM > /tmp/tmp.$$ # make gene predictions $BIN/geneid/bin/geneid $GENEID/geneid -GP /tmp/tmp.$$ $SEQFILE | gawk 'NR>5 {if ($2=="Sequence") print "#$"; if (substr($1,1,1)!="#") print }' - | egrep -v 'exon' - > /tmp/Predictions.$$.gff # evaluate gene predictions $EVAL/evaluation -sta /tmp/Predictions.$$.gff $CDSFILE | tail -2 | head -1 | \ gawk "{printf \"%6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f\\n\", $IoWF, $IeWF, \$1, \$2, \$3, \$4, \$5, \$6, \$7, \$8, \$12, \$13}" ## remove files rm /tmp/Predictions.$$.gff IeWF=`echo $IeWF $deWF | gawk '{print $1+$2}'` ef=`gawk "BEGIN{print ($IeWF <= $FeWF) ? 1 : 0}"` done IoWF=`echo $IoWF $doWF | gawk '{print $1+$2}'` of=`gawk "BEGIN{print ($IoWF <= $FoWF) ? 1 : 0}"` done . . . The optimization script requires as input the newly built Perkinsus marinus parameter file (Pmarinus.param.gz (gzipped)), the multi-fasta sequence containing the annotations (pmarinus4training.fa) and the evaluation tool-compatible gff annotations contained in the fastas (properly sorted to match the order of the sequences in the fasta file): pmarinus4training.eval.gff. The actual command line instruction and output is as follows: (the accuracy performance was sorted so that the bottom values should indicate oWF/ eWF values which optimize performance). It should be mentioned that this optimization process can be quite slow when dealing with large fastas as is the case here. #!/bin/bash Optimization_new.sh Pmarinus.param pmarinus4training.fa pmarinus4training.eval.gff > pmarinus4training.param.optimization ###sorting of accuracy performance sort +4n -5 +7n -8 +8nr -9 pmarinus4training.param.optimization oWF eWF SN SP CC SNe SPe SNSP raME raWE raMG raWG 0.40 -3.00 0.95 0.12 0.26 0.77 0.10 0.44 0.07 0.87 0.00 0.87 0.40 -2.50 0.96 0.12 0.26 0.79 0.09 0.44 0.05 0.89 0.00 0.88 0.30 -2.50 0.96 0.12 0.26 0.78 0.09 0.44 0.04 0.89 0.00 0.87 0.35 -2.50 0.96 0.12 0.26 0.79 0.09 0.44 0.04 0.88 0.00 0.88 0.35 -3.00 0.95 0.12 0.27 0.77 0.10 0.44 0.07 0.87 0.00 0.86 oWF = exon factor eWF = exon weight SN = sensitivity nucleotide level SP = specificity nucleotide level CC = correlation SN/SP SNe = sensitivity exon level SPe = specificity exon level SNSP = correlation SNe/Spe raME = ratio missing exons raWE = ratio wrong exons raMG = ratio missing genes raWG = ratio wrong genes NOTE: From the output of the "Optimization_new.sh" it will be obvious to the user that all accuracy estimates related to specificity are very low. The explanation for this is that the fasta files correponding to Perkinsus marinus sequences are not fully annotated and therefore geneid is predicting potentially real genes in regions that remain annotated hence lowering the specificity of the predictions. Regardless based on this optimization we select the combination of oWF/ eWF values that seem to maximize performance. These values ( 0.35 and -3.00 respectively) are then inserted into the parameter file. In a final step of the training process the fully optimized parameter file (pmarinus_geneid.param (gzip compressed)) is used to predict sequences on the evaluation (test) set fastas (pmarinus4evaluation.fa). However, in order to minimize the low specificity caused by the lack of a complete annotation of the multi-fasta sequence, prior to evaluation we extract the annotations with 500 nt of flanking sequence from the original fasta file. In order to do so the the use must run the awk script gff2gp.awk (setting the value of the line "context=" to 500) using pmarinus4evaluation.gff as the input file. The command line that should be used to obtain a golden path (gp)-compliant file is: #!/bin/bash gff2gp.awk pmarinus4evaluation.gff | sort > pmarinus4evaluation.gp The user can then use the output of "gff2gp.awk", pmarinus4evaluation.gp as the input of the perl script getgenes.pl that can be used to both obtain a new set of fasta files with the number of flanking nucleotides previously selected (i.e. 500), and to convert "pmarinus4evaluation.gp" file to gff format. It should be noted that "getgenes.pl" requires that another program (chromosomechunk ) be in the same directory as the input file (pmarinus4evaluation.gp). It is also necessary that the user convert the multi-fasta pmarinus4evaluation.tbl into 250 separate fastas using the awk script TblToFastaFile, and place them into a separate directory (i.e. ./fastas4evaluation/). Finally "getgenes.pl" outputs the sequences in tabular format that can be converted to fasta with TblToFasta. The following wrapper BASH script lists the steps described above: #!/bin/bash ## Convert the gff to gp (golden path) ## 500 flanking nucleotidesk gff2gp.awk pmarinus4evaluation.gff | sort > pmarinus4evaluation.gp ##fix to ensure that getgenes.pl does not crash when flanking sequence extends past end of sequence gawk '{print $1,$9}' pmarinus4evaluation.gff | sort | uniq > locus_id_evaluation gawk '{print $1,length($2)}' pmarinus4evaluation.tbl | sort > all_fastas_length join all_fastas_length locus_id_evaluation | gawk '{print $3,$1,$2}' > evaluation_contigs_length join evaluation_contigs_length pmarinus4evaluation.gp > pmarinus4evaluation_mod.gp gawk 'BEGIN{OFS=" ";FS=" ";}{if ($3<=$7){coord=1;tot=split($12,array,",");array[tot-1]=$3;s="";\ for(coord=1;coord<=tot-1;coord++){s=s""array[coord]","};print $1,$2,$5,$6,$3,$8,$9,$10,$11,s;}else{print $1,$2,$5,$6,$7,$8,$9,$10,$11,$12 }}' \ pmarinus4evaluation_mod.gp > pmarinus4evaluation.gp ### extract fastas from multi-fasta file and place them in a directory - mkdir -p ./fastas4evaluation/; cd ./fastas4evaluation/ TblToFastaFile ../pmarinus4evaluation.tbl cd .. ###extract sequence fastas + flanking sequence (make sure "chromosomechunk" is in the pmarinus4evaluation.gp directory) getgenes.pl ./fastas4evaluation/ pmarinus4evaluation.gp | \ gawk 'BEGIN{b="x"}{if ($1!=b){printf "\n%s ",$1}printf "%s",$6;b=$1}END{printf "\n"}' \ | sort | sed '/^$/d' - | gawk '{print $1, toupper($2)}' - > pmarinus4evaluation.gp.tbl TblToFasta pmarinus4evaluation.gp.tbl > pmarinus4evaluation.gp.fa mkdir -p ./fastas_gp4evaluation; cd ./fastas_gp4evaluation; TblToFastaFile pmarinus4evaluation.gp.tbl cd .. ###obtain gff annotation corresponding to extracted fastas getgenes.pl ./fastas4evaluation/ pmarinus4evaluation.gp | \ gawk 'BEGIN{OFS="\t";pos=1;b="x"}{if ($1!=b){pos=1}; print $1,"P.marinus",$3,pos,pos+$5-1,".","+",".",$1$2; pos+=$5;b=$1 }' \ | egrep -v "(Intron|Utr)" - > pmarinus4evaluation.gp.gffOnce the user has obtained the gff (pmarinus4evaluation.gp.gff) and fasta (pmarinus4evaluation.gp.fa) output files she should proceed to set up the gff file for evaluation (pmarinus4evaluation.gp.eval.gff; shown below and in the "optimization" section above), and then evaluate the accuracy of the geneid predictions on the test sequences using the newly developed and optimized parameter file (Pmarinus.param.gz (gzipped)) and the evaluation tool (Evaluation.tgz (tar-gzipped version)). Setting up the gff for the evaluation process requires some processing of the input files using a combination of linux and awk command line instructions as well as the awk script gff2cds. The command line instructions necessary in setting up and running the evaluation process are shown in the BASH script below: #!/bin/bash ###obtain list of training set fastas containing annotated genes with their respective length gawk '{print $1,length($2)}' pmarinus4evaluation.gp.tbl | sort | uniq \ > cds_tmp_gp4evaluation ###convert gff-format gene annotations into "cds" format using gff2cds tool gff2cds source="P.marinus" pmarinus4evaluation.gp.gff \ | sort | join cds_tmp_gp4evaluation - > pmarinus4evaluation.gp.cds ### set up gff file for compatibility with evaluation tool gawk 'BEGIN{while (getline< ARGV[1]>0){len[$1]=$2;};ARGV[1]="";OFS="\t";} \ {if (NR==1) {ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",".","."}; \ if ($1!=ant) {print "#$";ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",\ ".","."}; print }' pmarinus4evaluation.gp.cds pmarinus4evaluation.gp.gff > pmarinus4evaluation.gp.eval.gff ##run geneid on the 250 test fastas and evaluate based on previous optimization on the training set (oWF=0.35 eWF=-3.00) geneid -vGP Pmarinus.param pmarinus4evaluation.gp.fa \ | gawk 'NR>5 {OFS="\t";if ($3=="Gene") print "#$"; $2="geneid_Pmarinus"; \ if (substr($1,1,1)!="#") print }' - | egrep -v 'exon' - \ > pmarinus.geneid.eval.gff evaluation -sta pmarinus.geneid.eval.gff pmarinus4evaluation.gp.eval.gff | gawk '{OFS=" & ";$1=$1;print}' \ > evaluation.gp.geneid.pmarinus SN SP CC SNe SPe SNSP raME raWE SNg SPg SNSPg raMG raWG raJG raSG 0.92 0.90 0.84 0.72 0.71 0.72 0.10 0.12 0.38 0.28 0.33 0.00 0.27 1.00 1.01 SN = sensitivity nucleotide level SP = specificity nucleotide level CC = correlation SN/SP SNe = sensitivity exon level SPe = specificity exon level SNSP = correlation SNe/Spe raME = ratio missing exons raWE = ratio wrong exons SNg = sensitivity gene level SPg = specificity gene level SNSPg = correlation SNg/SPg raMG = ratio missing genes raWG = ratio wrong genes raJG = ratio joined genes raSG = ratio split genes As the end of the training/evaluation protocol the user should have a geneid parameter file which is rather optimized for the species of interest. The average prediction performance observed for P. marinus is fairly typical for a geneid parameter file (80-90% nucleotides, 70-80% exons and 30-40% genes accurately predicted) given a species of this complexity. (evaluation.gp.geneid.pmarinus) NOTE: There will also be instances in which the user will only have access to a limited amount of gene models and will not therefore be able to set aside sequences for testing the newly developed parameter file. In these cases to reduce the bias that would arise from testing accuracy on the same sequences as those on the training set it is recommended that the user apply a "leave-one-out" testing strategy. In this strategy one sequence at a time will be excluded from the training process and the accuracy subsequently determined on the excluded sequence. This process is then repeated for every sequence in the set. If the user requires help with any aspect of the geneid training process or in testing the accuracy of any newly developed geneid parameter file, for example by using this bootstrapping leave-one-out strategy she should contact us for assistance.
This training tutorial document was prepared by: Francisco Camara. geneid is under GNU General Public License. |
Disclaimer | webmaster |