Plant Molecular Systematics Phylogenetics.ppt

Distance based
phylogenetics
Usman Roshan

Phylogenetics
•Study of how species relate to each other
•“Nothing in biology makes sense, except
in the light of evolution”, Theodosius
Dobzhansky, Am. Biol. Teacher (1973)
•Rich in computational problems
•Fundamental tool in comparative
bioinformatics

Why phylogenetics?
•Study of evolution
–Origin and migration of humans
–Origin and spead of disease
•Many applications in comparative bioinformatics
–Sequence alignment
–Motif detection (phylogenetic motifs, evolutionary trace,
phylogenetic footprinting)
–Correlated mutation (useful for structural contact prediction)
–Protein interaction
–Gene networks
–Vaccine devlopment
–And many more…

Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y

Bipartitions
•Phylogenies are equivalent to bipartitions

Topological differences

Phylogeny Problem
•Two main methodologies:
–Alignment first and phylogeny second
•Construct alignment using one of the MANY alignment programs in
the literature
•Do manual (eye) adjustments if necessary
•Apply a phylogeny reconstruction method
•Fast but biologically not realistic
•Phylogeny is highly dependent on accuracy of alignment (but so is
the alignment on the phylogeny!)
–Simultaneously alignment and phylogeny reconstruction
•Output both an alignment and phylogeny
•Computationally much harder
•Biologically more realistic as insertions, deletions, and mutations
occur during the evolutionary process

First methodology
•Compute alignment (for now we assume we are given an
alignment)
•Construct a phylogeny (two approaches)
•Distance-based methods
–Input: Distance matrix containing pairwise statistical estimation
of aligned sequences
–Output: Phylogenetic tree
–Fast but less accurate
•Character-based methods
–Input: Sequence alignment
–Output: Phylogenetic tree
–Accurate but computationally very hard

Distance-based methods

Evolution on a single edge
•Poisson process
–Number of changes in a fixed time interval t is
independent of changes in any other non-overlapping
time interval u
–Number of changes in time interval t is proportional to
the length of the interval
–No changes in time interval of length 0
•Let X be the number of nucleotide changes on a
single edge. We assume X is a Poisson process
•Probability dictates that

Evolution on a single edge
•We want to compute (the probability of a
nucleotide change on edge e)
•The probability of observing a change is just the
sum of probabilities of observing k changes over
all possible values of k (excluding even ones
because those changes cannot be seen)

•Expected number of nucleotide changes on a
given edge is given by
•Key: is additive
Evolution on a single edge

Additivity
•Assume we have a path of k edges and that p1,
p2,…, pk are the probabilities of change on each
edge of the path
•Using induction we can show that
•Multiplicative term is hard to deal with and does
not easily decompose into a product or sum of p
i’s

Additivity
•But the expected number of nucleotide
changes on the path p is elegant

Evolutionary models
•Simple 0,1 alphabet
evolutionary model
–i.i.d. model
–uniformly random root
sequence
•Jukes-Cantor:
–Uniformly random root
sequence
–i.i.d. model

•General Markov
Model
–Uniformly random root
sequence
–i.i.d. model
–For time reversible
models
Evolutionary models

Variation across sites
•Standard assumption of how sites can
vary is that each site has a multiplicative
scaling factor
•Typically these scaling factors are drawn
from a Gamma distribution (or Gamma
plus invariant)

Special issues
•Molecular clock: the expected number of
changes for a site is proportional to time
•No-common-mechanism model: there is a
random variable for every combination of
edge and site

Evolutionary distance estimation

Estimating evolutionary distances
•For sequences A and B what is the evolutionary
distance under the Jukes-Cantor model?
–ACCTGTGGGTAACCACCC
–ACCTGAGGGATAGGTCCG
•But we don’t know what is

Estimating evolutionary distances
•Assume nucleotide changes are Bernoulli trials
(i.i.d. trials of success or failure)
• is probability of head in n Bernoulli trials (n is
sequence length)
•Compute a maximum likelihood estimate for
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1

Estimating evolutionary distance
•We want to find the value of p that maximizes
the probability:
•Set dP/dp to 0 and solve for p to get

Estimating evolutionary distances
• = 5/18
•Continuing in this manner we estimate for
all pairs of sequences in the alignment
•We now have a distance matrix under a
biologically sound evolutionary model
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG

0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1

Distance methods

Distance methods
•UPGMA: similar to hierarchical clustering
but not additive
•Neighbor-joining: more sophisticated and
additive
•What is additivity?

Additivity

UPGMA
UPGMA is not additive but works for
ultrametric trees. Takes O(n^3) time
A DCB
3
33
1010
A
B
C
D
ABCD
62626
2626
6
3

1.Initialize n clusters where each cluster i contains the
sequence i
2.Find closest pair of clusters i, j, using distances in
matrix D
3.Make them neighbors in the tree by adding new node
(ij), and set distance from (ij) to i and j as Dij/2
4.Update distance matrix D: for all clusters k do the
following (ni and nj are size of clusters i and j
respectively)
5.Delete columns and rows for i and j in D and add new
ones corresponding to cluster (ij) with distances as
computed above
6.Goto step 2 until only one cluster is left
UPGMA

UPGMA
A
B
C
D
ABCD
62626
2626
6
A DCB
3
33
1010
3

UPGMA
Doesn’t work (in general) for non-ultrametric
trees
A D
CB10
3
3 3
3
10
A
B
C
D
ABCD
131926
1219
13

UPGMA
UPGMA constructs incorrect tree here
A
B
C
D
ABCD
131926
1219
13
B DAC

UPGMA
Bipartition (BC,AD) is not in true tree
B DAC
A D
CB10
3
3 3
3
10
True tree UPGMA tree

Neighbor joining
1.Additive and O(n^3) time
2.Initialization: same as UPGMA
3.For each species compute
4.Select i and j for which is
minimum
5.Make them neighbors in the tree by adding
new node (ij), and set distance from (ij) to i
and j as

Neighbor joining
6.Update distance matrix D: for all clusters k do
the following
7.Delete columns and rows for i and j in D and
add new ones corresponding to cluster (ij) with
distances as computed above
8.Go to 3 until two nodes/clusters are left

NJ
NJ constructs the correct tree for additive
matrices
A D
CB10
3
3 3
3
10
A
B
C
D
ABCD
131926
1219
13

Simulation studies

Simulation studies
•The true evolutionary tree is never known in
practice. Simulation allows us to study accuracy
of methods under biologically realistic scenarios
•Mathematics behind the phylogenetics is often
complex and challenging. Simulation allows us
to study algorithms when not possible
theoretically and also examine algorithm
performance under various conditions such as
different evolutionary rates, sequence lengths,
or numbers of taxa

Statistical consistency
•As sequence lengths tend to infinity the distance
estimation improves and eventually leads to the true
additive matrix
•If a method like NJ is then applied we get the true tree.
•In practice, however, we have limited sequence length.
Therefore we want to know how much sequence length a
method requires to achieve low error

Convergence rates
•Can be studied
experimentally or
theoretically
•Theoretical results
offer loose bounds
•Experiments (under
simulation) provide
more realistic bounds
on sequence lengths

Sequence length requirements

Typical performance study

Sequence lengths for NJ
Sequence lengths required to obtain 90% accuracy

Error rate of NJ

Plant Molecular Systematics Phylogenetics.ppt

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Plant Molecular Systematics Phylogenetics.ppt

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx