Summary
Misspecification of the evolutionary model, which
describes the substitution processes along each edge of a
phylogenetic tree, has important implications for the analysis of
phylogenetic data. Conventionally, however, the selection of a
suitable evolutionary model is based on heuristics or relies on the
choice of an approximate input tree. Moreover, there are no
established methods that accommodate phylogenetic mixture models,
which are appropriate in settings where data consists of regions
with different patterns of evolution (e.g., concatenated genes or
codon specific position inference). We propose an approach that
circumvents these issues by using recent insights on linear invariants
that characterize a model of evolution in phylogenetic mixture
models with any number of mixture components.
These invariants are linear constraints
among the joint probabilities for the bases in the contemporary
species that hold irrespective of the tree topologies appearing in
the mixtures.
References:
A. M. Kedzierska, M. Drton, R. Guigo and M. Casanellas, "SPIn: model selection for phylogenetic mixtures via linear invariants." (Mol. Biol. Evol., 29(3): 929-937, 2012).
Currently supported evolutionary models are non-homogeneous the Kimura 2-paramater (K80*),
Kimura 3-parameter (K81*), Jukes-Cantor (JC69*)
and the Strand Symmetric Model (SMM).
M. Casanellas, J. Fernandez-Sanchez and A. M. Kedzierska, "The space of phylogenetic mixtures of equivariant models",
submitted to the special issue of Algorithms for Molecular Biology in Phylogenetics
Users are encouraged to refer to the accompanying paper for the discussion on the advantages as well as current limitations of the method.
Using SPIn
Input format to SPIn is a fasta file. Current maximum number of operational taxonomic units is 21 and sequence length of 1 million bases. This release of the software uses the Akaike Information Criterion (AICc) to score among the candidate non-homogeneous classes of models. The best-fit model minimizes the AICc score. In addition, the output reports the weights of support for each of the model and an upper bound on the number of mixtures, above which the non-identifiability of the parameters (both continuous and discrete) holds.
genNon-h
Matlab code.
The algorithms used to generate the data for this work were further elaborated and implemented as an efficient and user-friendly C++ package:
GenNon-H
. MSA used for performance tests
Universitat Politecnica de Catalunya
Center for Genomic regulation
Support and feedback: