geneid docs

geneid documentation: 8. geneid parameter file

Table of contents:

Description
The GENE MODEL
Full description
INDEX

Description:

geneid depends on a parameter file to build the predictions. The parameter file contains mostly the description of the probabilistic model on which the predictions are based. It also contains the so-called gene model at the end of the parameter file, the set of rules describing how to chain gene elements (such as exons) into gene predictions. Through the usage of the gene model and the options O/R, geneid offers support for the integration of predictions from multiple sources.

The GENE MODEL

The gene model is the list of rules describing the constrains under which, predicted gene elements must be joined together in the final output. This constrains refer to the succession of elements in the gene structure and to the range of allowed distances among them.

#Intragenic connections First+:Internal+ Internal+:Terminal+ 40:11000

For instance, the rule above indicates that elements (exons) of type Internal or Terminal, must be chained immediately after elements of type First or Internal in the forward strand. The third column indicates the range at which they can be chained. In this rule, the predicted elements must be at least 40 bp and at most 11000, apart. The equivalent rule for the reverse sense is:

#Intragenic connections (reverse) Terminal-:Internal- Internal-:First- 40:11000

The following rule specifies the constrains governing connections between genes:

#Intragenic connections (reverse) Promoter-:First-:Terminal+ Promoter+:Terminal-:First+ 2000:Infinity Promoter+ First+ 50:4000 First- Promoter- 50:4000

The first line describes the relationship between the end of a gene and the beginning of another one both in the forward as well as the beginning of one gene and the end of another one in the reverse sense. Moreover, hybrids are also defined (e.g. end of a reverse gene with beginning of a forward gene). The second and third line define the right place to put a promoter related to one gene in the same strand. To specify no maximum distance constrains, the keyword Infinity must be used.
The present version of geneid predicts elements with types First, Internal, Terminal and Single. Other elements in the additional files provided externally (O/R options) are ignored when they are not defined in any rule of the gene model.
The gene model dos not specify which elements terminate genes (which are assumed to be represented by the group field of the gff file). Gene termination is coded within the program with the features First+/Terminal- (start) and First-/Terminal+ (end) while Single+/Single- are both start and end from genes.

Full description:

(1) This the general structure in geneid parameter files: information model to predict genomic elements according to the G+C content of every sequence fragment, and gene model to assemble the predictions following series of rules.

#1. ISOCHORE DEPENDENT INFORMATION number_of_isochores 3 # SET OF PARAMETERS FOR ISOCHORE 1 boundaries_of_isochore 0 40 ... # SET OF PARAMETERS FOR ISOCHORE 2 boundaries_of_isochore 40 70 ... # SET OF PARAMETERS FOR ISOCHORE 3 boundaries_of_isochore 70 100 ... #2. GENE ASSEMBLING RULES (gene model) ...

(2) This the general structure for every isochore model: cutoff values to filter exons after scoring, profiles to discover genomic signals, Markov model matrices to measure coding potential property and maximum number of exons starting with the same starting signal.

# SET OF PARAMETERS FOR ISOCHORE 1 boundaries_of_isochore 0 40 #1. SET OF PARAMETERS to filter exons ... #2. PROFILES TO PREDICT SIGNALS ... #3. MARKOV MODEL TO ESTIMATE CODING POTENTIAL ... #4. Maximum amount of exons with every left signal maximum_number_of_donors_per_acceptor_site 5

(3) Parameters for filtering exons depending on their coding potential score: cutoff on the final exon score, cutoff on the protein coding potential score, weighted value for coding potential against signals score and exon weight parameter (value added to exons which have overcome all of the previous filters: penalty to avoid predicting long genes with too many genes due to the additive schema under gene assembling model). There are always 4 values for each of exon types: First, Internal, Terminal and Single.

# SET OF PARAMETERS to filter exons Total_score_cutoff -15 -15 -15 -15 Coding_potential_score_cutoff -10 -15 -15 -15 Weight_of_coding_potential_score_in_exons 0.4 0.4 0.4 0.4 Exon_weight_values -7 -7 -7 -7

(4) This a generic profile to predict genomic signals: start / stop codons or acceptor / donor splice sites. This type of profile is called Position Weight Array: a matrix addressed by (nucleotide,position) in which every cell contains a loglikelihood ratio between a Markov model (order k) recognizing true sites and another one, matching false ones. Therefore, for every nucleotide in a candidate region, is scored the probability to find the oligonucleotide (length k) before that nucleotide whether the region contains a true site or not. PWA are well characterized by defining 4 parameters:

length of profile (region to be scanned)
offset: distance from the beginning of the profile to the characteristic or core element for that signal (i.e. ATG in start codons)
cutoff: score to filter false signals
order (Markov chains): length of oligonucleotides used to score every element in the candidate region

Start profile 20 14 -6 2 # Position Weight Array 1 AAA -0.230297 1 AAC 0.519562 1 AAG 0.301505 1 AAT -0.519705 1 ACA -0.688891 1 ACC 0.483538 ... 20 TGT -0.483268 20 TTA -0.545436 20 TTC 0.557414 20 TTG -0.0335678 20 TTT -0.172498 # e.g. Given S=ATGAGC then # score(GAGC) = score(1,ATG) + score(2,TGA) + score(3,GAG) + score(4,AGC)

(5) Model to measure the protein coding potential of genomic regions. A Markov model (chains of order 5) are usually used, having 2 types of matrices: initial (for scoring the first pentanucleotide of an exon) and transition (for scoring the exon by taking 5+1 nucleotides until reaching the end of the exon). Initial matrix has been computed by measuring the ratio between frequency of pentanucleotides found in real exons and contained in intronic regions. Transition matrix has been computed by measuring the ratio between the probability to find a pentanucleotide X in codon position CP (translation) before a given nucleotide, in a real exon or in a false one (intronic region).

# Markov model (log likelihood ratio) Markov_model_order 5 Markov_initial_probability_matrix AAAAA 0 0 -0.85727 AAAAA 0 1 -1.48328 AAAAA 0 2 -1.72858 AAAAC 1 0 -0.377093 AAAAC 1 1 -0.202228 AAAAC 1 2 -0.961698 ... Markov_transition_probability_matrix AAAAAA 0 0 -1.10797 AAAAAA 0 1 -0.736771 AAAAAA 0 2 -0.570196 AAAAAC 1 0 0.961644 AAAAAC 1 1 0.516564 AAAAAC 1 2 0.646909

(6) Gene model rules. Rules defining which exon types are allowed to join to others, respecting minimum and maximum distance requirement. First column of rule are features which will be assembled before features in the second one, respecting the distances (third column). The word "block" is used to preserve connections between exons sharing the same identifier (group), as annotated genes (evidences) without mixing them with ab initio predictions. Minimum distance is the smallest allowed distance between the end and start of both connected exons while maximum distance is the highest allowed distance to connect two exons (the reserved word "Infinity" is used to disable the maximum restriction to save time consuming). Connection between exons, depending on their type, are sometimes actual gene connections (i.e. a Terminal exon with a First exon means joining the end of a gene with the beginning of another one).

#Intronic connections First+:Internal+ Internal+:Terminal+ 20:25000 block Terminal-:Internal- First-:Internal- 20:25000 block