Phylogenetics
•Study of how species relate to each other
•“Nothing in biology makes sense, except
in the light of evolution”, Theodosius
Dobzhansky, Am. Biol. Teacher (1973)
•Rich in computational problems
•Fundamental tool in comparative
bioinformatics
Why phylogenetics?
•Study of evolution
–Origin and migration of humans
–Origin and spead of disease
•Many applications in comparative bioinformatics
–Sequence alignment
–Motif detection (phylogenetic motifs, evolutionary trace,
phylogenetic footprinting)
–Correlated mutation (useful for structural contact prediction)
–Protein interaction
–Gene networks
–Vaccine devlopment
–And many more…
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
Bipartitions
•Phylogenies are equivalent to bipartitions
Topological differences
Phylogeny Problem
•Two main methodologies:
–Alignment first and phylogeny second
•Construct alignment using one of the MANY alignment programs in
the literature
•Do manual (eye) adjustments if necessary
•Apply a phylogeny reconstruction method
•Fast but biologically not realistic
•Phylogeny is highly dependent on accuracy of alignment (but so is
the alignment on the phylogeny!)
–Simultaneously alignment and phylogeny reconstruction
•Output both an alignment and phylogeny
•Computationally much harder
•Biologically more realistic as insertions, deletions, and mutations
occur during the evolutionary process
First methodology
•Compute alignment (for now we assume we are given an
alignment)
•Construct a phylogeny (two approaches)
•Distance-based methods
–Input: Distance matrix containing pairwise statistical estimation
of aligned sequences
–Output: Phylogenetic tree
–Fast but less accurate
•Character-based methods
–Input: Sequence alignment
–Output: Phylogenetic tree
–Accurate but computationally very hard
Distance-based methods
Evolution on a single edge
•Poisson process
–Number of changes in a fixed time interval t is
independent of changes in any other non-overlapping
time interval u
–Number of changes in time interval t is proportional to
the length of the interval
–No changes in time interval of length 0
•Let X be the number of nucleotide changes on a
single edge. We assume X is a Poisson process
•Probability dictates that
Evolution on a single edge
•We want to compute (the probability of a
nucleotide change on edge e)
•The probability of observing a change is just the
sum of probabilities of observing k changes over
all possible values of k (excluding even ones
because those changes cannot be seen)
•Expected number of nucleotide changes on a
given edge is given by
•Key: is additive
Evolution on a single edge
Additivity
•Assume we have a path of k edges and that p1,
p2,…, pk are the probabilities of change on each
edge of the path
•Using induction we can show that
•Multiplicative term is hard to deal with and does
not easily decompose into a product or sum of p
i’s
Additivity
•But the expected number of nucleotide
changes on the path p is elegant
Evolutionary models
•Simple 0,1 alphabet
evolutionary model
–i.i.d. model
–uniformly random root
sequence
•Jukes-Cantor:
–Uniformly random root
sequence
–i.i.d. model
•General Markov
Model
–Uniformly random root
sequence
–i.i.d. model
–For time reversible
models
Evolutionary models
Variation across sites
•Standard assumption of how sites can
vary is that each site has a multiplicative
scaling factor
•Typically these scaling factors are drawn
from a Gamma distribution (or Gamma
plus invariant)
Special issues
•Molecular clock: the expected number of
changes for a site is proportional to time
•No-common-mechanism model: there is a
random variable for every combination of
edge and site
Evolutionary distance estimation
Estimating evolutionary distances
•For sequences A and B what is the evolutionary
distance under the Jukes-Cantor model?
–ACCTGTGGGTAACCACCC
–ACCTGAGGGATAGGTCCG
•But we don’t know what is
Estimating evolutionary distances
•Assume nucleotide changes are Bernoulli trials
(i.i.d. trials of success or failure)
• is probability of head in n Bernoulli trials (n is
sequence length)
•Compute a maximum likelihood estimate for
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
Estimating evolutionary distance
•We want to find the value of p that maximizes
the probability:
•Set dP/dp to 0 and solve for p to get
Estimating evolutionary distances
• = 5/18
•Continuing in this manner we estimate for
all pairs of sequences in the alignment
•We now have a distance matrix under a
biologically sound evolutionary model
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
Distance methods
Distance methods
•UPGMA: similar to hierarchical clustering
but not additive
•Neighbor-joining: more sophisticated and
additive
•What is additivity?
Additivity
UPGMA
UPGMA is not additive but works for
ultrametric trees. Takes O(n^3) time
A DCB
3
33
1010
A
B
C
D
ABCD
62626
2626
6
3
1.Initialize n clusters where each cluster i contains the
sequence i
2.Find closest pair of clusters i, j, using distances in
matrix D
3.Make them neighbors in the tree by adding new node
(ij), and set distance from (ij) to i and j as Dij/2
4.Update distance matrix D: for all clusters k do the
following (ni and nj are size of clusters i and j
respectively)
5.Delete columns and rows for i and j in D and add new
ones corresponding to cluster (ij) with distances as
computed above
6.Goto step 2 until only one cluster is left
UPGMA
UPGMA
A
B
C
D
ABCD
62626
2626
6
A DCB
3
33
1010
3
UPGMA
Doesn’t work (in general) for non-ultrametric
trees
A D
CB10
3
3 3
3
10
A
B
C
D
ABCD
131926
1219
13
UPGMA
UPGMA constructs incorrect tree here
A
B
C
D
ABCD
131926
1219
13
B DAC
UPGMA
Bipartition (BC,AD) is not in true tree
B DAC
A D
CB10
3
3 3
3
10
True tree UPGMA tree
Neighbor joining
1.Additive and O(n^3) time
2.Initialization: same as UPGMA
3.For each species compute
4.Select i and j for which is
minimum
5.Make them neighbors in the tree by adding
new node (ij), and set distance from (ij) to i
and j as
Neighbor joining
6.Update distance matrix D: for all clusters k do
the following
7.Delete columns and rows for i and j in D and
add new ones corresponding to cluster (ij) with
distances as computed above
8.Go to 3 until two nodes/clusters are left
NJ
NJ constructs the correct tree for additive
matrices
A D
CB10
3
3 3
3
10
A
B
C
D
ABCD
131926
1219
13
Simulation studies
Simulation studies
•The true evolutionary tree is never known in
practice. Simulation allows us to study accuracy
of methods under biologically realistic scenarios
•Mathematics behind the phylogenetics is often
complex and challenging. Simulation allows us
to study algorithms when not possible
theoretically and also examine algorithm
performance under various conditions such as
different evolutionary rates, sequence lengths,
or numbers of taxa
Statistical consistency
•As sequence lengths tend to infinity the distance
estimation improves and eventually leads to the true
additive matrix
•If a method like NJ is then applied we get the true tree.
•In practice, however, we have limited sequence length.
Therefore we want to know how much sequence length a
method requires to achieve low error
Convergence rates
•Can be studied
experimentally or
theoretically
•Theoretical results
offer loose bounds
•Experiments (under
simulation) provide
more realistic bounds
on sequence lengths
Sequence length requirements
Sequence length requirements
Typical performance study
Sequence lengths for NJ
Sequence lengths required to obtain 90% accuracy