(1)
This the general structure in geneid parameter files:
information model to predict genomic elements according to the G+C content
of every sequence fragment, and gene model to assemble the predictions
following series of rules. |
#1. ISOCHORE DEPENDENT INFORMATION
number_of_isochores
3
# SET OF PARAMETERS FOR ISOCHORE 1
boundaries_of_isochore
0 40
...
# SET OF PARAMETERS FOR ISOCHORE 2
boundaries_of_isochore
40 70
...
# SET OF PARAMETERS FOR ISOCHORE 3
boundaries_of_isochore
70 100
...
#2. GENE ASSEMBLING RULES (gene model)
...
|
(2)
This the general structure for every isochore model: cutoff values to filter
exons after scoring, profiles to discover genomic signals, Markov
model matrices to measure coding potential property and maximum number
of exons starting with the same starting signal.
|
# SET OF PARAMETERS FOR ISOCHORE 1
boundaries_of_isochore
0 40
#1. SET OF PARAMETERS to filter exons
...
#2. PROFILES TO PREDICT SIGNALS
...
#3. MARKOV MODEL TO ESTIMATE CODING POTENTIAL
...
#4. Maximum amount of exons with every left signal
maximum_number_of_donors_per_acceptor_site
5
|
(3)
Parameters for filtering exons depending on their coding potential score:
cutoff on the final exon score, cutoff on the protein coding potential
score, weighted value for coding potential against signals score and
exon weight parameter (value added to exons which have overcome all of the
previous filters: penalty to avoid predicting long genes with too many genes
due to the additive schema under gene assembling model). There are always
4 values for each of exon types: First, Internal, Terminal and Single.
|
# SET OF PARAMETERS to filter exons
Total_score_cutoff
-15 -15 -15 -15
Coding_potential_score_cutoff
-10 -15 -15 -15
Weight_of_coding_potential_score_in_exons
0.4 0.4 0.4 0.4
Exon_weight_values
-7 -7 -7 -7
|
(4) This a generic profile to predict genomic signals: start / stop
codons or acceptor / donor splice sites. This type of profile is called
Position Weight Array: a matrix addressed by (nucleotide,position) in which
every cell contains a loglikelihood ratio between a Markov model (order k)
recognizing true sites and another one, matching false ones. Therefore,
for every nucleotide in a candidate region, is scored the probability to
find the oligonucleotide (length k) before that nucleotide whether the region
contains a true site or not. PWA are well characterized by defining 4 parameters:
- length of profile (region to be scanned)
- offset: distance from the beginning of the profile to the characteristic
or core element for that signal (i.e. ATG in start codons)
- cutoff: score to filter false signals
- order (Markov chains): length of oligonucleotides used to score every
element in the candidate region
|
Start profile
20 14 -6 2
# Position Weight Array
1 AAA -0.230297
1 AAC 0.519562
1 AAG 0.301505
1 AAT -0.519705
1 ACA -0.688891
1 ACC 0.483538
...
20 TGT -0.483268
20 TTA -0.545436
20 TTC 0.557414
20 TTG -0.0335678
20 TTT -0.172498
# e.g. Given S=ATGAGC then
# score(GAGC) = score(1,ATG) + score(2,TGA) + score(3,GAG) + score(4,AGC)
|
(5) Model to measure the protein coding potential of genomic regions.
A Markov model (chains of order 5) are usually used, having 2 types of
matrices: initial (for scoring the first pentanucleotide of an exon) and
transition (for scoring the exon by taking 5+1 nucleotides until reaching
the end of the exon). Initial matrix has been computed by measuring the
ratio between frequency of pentanucleotides found in real exons and
contained in intronic regions. Transition matrix has been computed by
measuring the ratio between the probability to find a pentanucleotide X in
codon position CP (translation) before a given nucleotide, in a real exon or
in a false one (intronic region).
|
# Markov model (log likelihood ratio)
Markov_model_order
5
Markov_initial_probability_matrix
AAAAA 0 0 -0.85727
AAAAA 0 1 -1.48328
AAAAA 0 2 -1.72858
AAAAC 1 0 -0.377093
AAAAC 1 1 -0.202228
AAAAC 1 2 -0.961698
...
Markov_transition_probability_matrix
AAAAAA 0 0 -1.10797
AAAAAA 0 1 -0.736771
AAAAAA 0 2 -0.570196
AAAAAC 1 0 0.961644
AAAAAC 1 1 0.516564
AAAAAC 1 2 0.646909
|
(6) Gene model rules. Rules defining which exon types are allowed
to join to others, respecting minimum and maximum distance requirement.
First column of rule are features which will be assembled before features
in the second one, respecting the distances (third column). The word
"block" is used to preserve connections between exons sharing the same
identifier (group), as annotated genes (evidences) without mixing them
with ab initio predictions. Minimum distance is the smallest allowed
distance between the end and start of both connected exons while maximum
distance is the highest allowed distance to connect two exons (the reserved
word "Infinity" is used to disable the maximum restriction to save time
consuming). Connection between exons, depending on their type, are sometimes
actual gene connections (i.e. a Terminal exon with a First exon means joining
the end of a gene with the beginning of another one).
|
#Intronic connections
First+:Internal+ Internal+:Terminal+ 20:25000 block
Terminal-:Internal- First-:Internal- 20:25000 block
|