SPIn: model Selection in Phylogenetics based on algebraic INvariants

Summary

Misspecification of the evolutionary model, which describes the substitution processes along each edge of a phylogenetic tree, has important implications for the analysis of phylogenetic data. Conventionally, however, the selection of a suitable evolutionary model is based on heuristics or relies on the choice of an approximate input tree. Moreover, there are no established methods that accommodate phylogenetic mixture models, which are appropriate in settings where data consists of regions with different patterns of evolution (e.g., concatenated genes or codon specific position inference). We propose an approach that circumvents these issues by using recent insights on linear invariants that characterize a model of evolution in phylogenetic mixture models with any number of mixture components.

These invariants are linear constraints among the joint probabilities for the bases in the contemporary species that hold irrespective of the tree topologies appearing in the mixtures.

References:
A. M. Kedzierska, M. Drton, R. Guigo and M. Casanellas, "SPIn: model selection for phylogenetic mixtures via linear invariants." (Mol. Biol. Evol., 29(3): 929-937, 2012).
Currently supported evolutionary models are non-homogeneous the Kimura 2-paramater (K80*), Kimura 3-parameter (K81*), Jukes-Cantor (JC69*) and the Strand Symmetric Model (SMM).



M. Casanellas, J. Fernandez-Sanchez and A. M. Kedzierska, "The space of phylogenetic mixtures of equivariant models", submitted to the special issue of Algorithms for Molecular Biology in Phylogenetics


Users are encouraged to refer to the accompanying paper for the discussion on the advantages as well as current limitations of the method.

Using SPIn

Input format to SPIn is a fasta file. Current maximum number of operational taxonomic units is 21 and sequence length of 1 million bases. This release of the software uses the Akaike Information Criterion (AICc) to score among the candidate non-homogeneous classes of models. The best-fit model minimizes the AICc score. In addition, the output reports the weights of support for each of the model and an upper bound on the number of mixtures, above which the non-identifiability of the parameters (both continuous and discrete) holds.


Multiple sequence alignment to upload:



genNon-h

Matlab code.

The algorithms used to generate the data for this work were further elaborated and implemented as an efficient and user-friendly C++ package:

GenNon-H

.

MSA used for performance tests

Universitat Politecnica de Catalunya
Center for Genomic regulation
Support and feedback: