HOME
AcE homepage
UPF IMIM GRIB HOME SOFTWARE AcE
Contents

  1. What's AcE?
  2. Main features
  3. Download
  4. If you encounter problems ...
  5. References
  6. Authors and acknowledgements

What's AcE?

AcE is a program to aid gene prediction accuracy evaluation. It uses GFF format to make it easy to convert gene prediction results into an analyzable format. Novel features include isoform accuracy evaluation from either the annotated gene or gene prediction perspective or both at the same time. Masking of genomic sequence which has unknown features allows gene predictions in annotated regions to be analyzed in a genomic context. Test sets, such as an artificial sequence test set or genomic context test set, can be generated by selecting specified annotated sequences from a master set.

Main features

  • Usage:
    /your_bin_path/ace [ options ] <PredictionSets>

  • Available options:
    • -help
      This message

    • -tseqset <SeqSET>
      Test set of sequences to evaluate

    • -annoset <AnnotatedSet>
      Set of Annotations to compare predictions against

    • -genesdir <Directory>
      e.g. /home/jsmith/genes (the directory where the *.seq or *.anno or *.result files are kept)

    • -testsetdir <Directory>
      e.g. /home/jsmith/testsets (files containing subsets of gene filenames for targeted accuracy evaluations)

    • -restrict <AnnotatedSet|PredictionSet>
      See below for description (Default is <AnnotatedSet>)

    • -firstisoform
      Select first Isoform for accuracy evaluation

    • -nogeneacc
      Don't compute gene (protein) level accuracy

    • -printsets
      Used alone to print out what test sets are available

    • -verbose
      Prints messages to STDOUT to keep one abreast of what's going on

  • This program compares GFF files, annotation vs prediction, to determine the accuracy of the gene prediction method. A test set of sequences, <SeqSET>, are used where the names of the sequences are <NAME> with the associated Sequence and <AnnotatedSet|PredictionSet> files <NAME>.seq, <NAME>.anno.gff, <NAME>.genscan.gff. The <NAME>.seq file is a FASTA file.

  • The <CompareSets> GFF files need to have the standard GFF format. Blank lines and lines starting with a # are ignored. All fields are separated via the TAB character. The feature field must have CDS in it to become part of a predicted gene. The group field for a set of CDS exons must be the same for all CDS exons from the same gene. Isoforms (Alternative Splice forms) must have a ".AltSpl0"..".AltSpl<N>" post-fixed to the group identifier for a CDS. One can also indicate "Annotated" sequence using Annotated in the Feature field with the appropriate LEnd and REnd.

  • The GFF Annotated feature allows one to predict genes in a genomic sequence where the annotation is only known for a fraction of the sequence. All predictions, annotations will be ignored for any sequence that is not in the GFF Annotated range of sequence. GFF Annotated features will be taken only from one set, <AnnotatedSet> or a <PredictionSet>. If one has generated a new set of prediction GFF's called "perfect", but these predictions only apply to a small portion of the sequence, then one can put GFF Annotation features in the GFF file of the <SeqSet>.perfect.gff file indicating the range of sequence to which the predictions apply and also set '-restrict perfect'. This way one does not get hit for missed annotations that fall outside of the sequence that was predicted for genes.

  • If Isoforms exist for the annotation or prediction CDS's, one can only give one <AnnotationCompareSet> and one <PredictionCompareSet>.

  • Results of the comparisons will be given in the form of:
    • NT Results (see Burset, 1996)
    • Exon Results (see Burset, 1996)
    • Gene Results
    • FSA: (Similar residues/annotated gene length) using 'align' program (Myers & Miller, 1988)
    • FSL: (Similar residues/longest {predicted or annotated} gene length) using 'align' program (Myers & Miller, 1988)
    • P/allA: Number of predicted genes that overlap annotated genes divided by total number of annotated genes
    • A/allP: Number of annotated genes that overlap predicted genes divided by total number of predicted genes
    • P/A: Number of predicted genes that overlap annotated genes divided by number of annotated genes that overlap predicted genes
    • MG: Number of annotated genes with no predicted gene overlap divided by the total number of annotated genes
    • WG: Number of wrong gene predictions with no annotated gene overlap divided by the total number of gene predictions

  • BTW, you can abbreviate option names!

Download

The AcE distribution contains the perl script ace.pl and a set of test sequences. The files can be obtained from the ftp server (genome.imim.es). In order to use the gene level accuracy statistics you must use the 'align' program (Myers & Miller, 1988).

The available files include:

  1. AcE perl script
  2. Test Datasets

To do (if bugged :)

  1. add documentation
  2. speed up, clean up, modularize (basically the program does what is is supposed to, but it doesn't have much error checking and is ugly code)
  3. add user submitted test datasets

If you encounter problems...

If you encounter problems using AcE, or have suggestions on how to improve it send an e-mail to William_S_Hayes@sbphrd.com

References

  1. GFF, http://www.sanger.ac.uk/Software/formats/GFF/
  2. M. Burset and R. Guigó, "Evaluation of Gene Structure Prediction Programs",Genomics, 34:353-367 (1996).

Authors and acknowledgements

The current version of AcE was written by William Hayes.
The test datasets were constructed by Vineet Bafna and William Hayes.



CopyRight © 2000

AcE is under GNU General Public License.

Disclaimer
webmaster