Plant Molecular Systematics Phylogenetics.ppt

bah292827 44 views 44 slides Aug 12, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Plant Molecular Systematics Phylogenetics.ppt


Slide Content

Distance based
phylogenetics
Usman Roshan

Phylogenetics
•Study of how species relate to each other
•“Nothing in biology makes sense, except
in the light of evolution”, Theodosius
Dobzhansky, Am. Biol. Teacher (1973)
•Rich in computational problems
•Fundamental tool in comparative
bioinformatics

Why phylogenetics?
•Study of evolution
–Origin and migration of humans
–Origin and spead of disease
•Many applications in comparative bioinformatics
–Sequence alignment
–Motif detection (phylogenetic motifs, evolutionary trace,
phylogenetic footprinting)
–Correlated mutation (useful for structural contact prediction)
–Protein interaction
–Gene networks
–Vaccine devlopment
–And many more…

Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y

Bipartitions
•Phylogenies are equivalent to bipartitions

Topological differences

Phylogeny Problem
•Two main methodologies:
–Alignment first and phylogeny second
•Construct alignment using one of the MANY alignment programs in
the literature
•Do manual (eye) adjustments if necessary
•Apply a phylogeny reconstruction method
•Fast but biologically not realistic
•Phylogeny is highly dependent on accuracy of alignment (but so is
the alignment on the phylogeny!)
–Simultaneously alignment and phylogeny reconstruction
•Output both an alignment and phylogeny
•Computationally much harder
•Biologically more realistic as insertions, deletions, and mutations
occur during the evolutionary process

First methodology
•Compute alignment (for now we assume we are given an
alignment)
•Construct a phylogeny (two approaches)
•Distance-based methods
–Input: Distance matrix containing pairwise statistical estimation
of aligned sequences
–Output: Phylogenetic tree
–Fast but less accurate
•Character-based methods
–Input: Sequence alignment
–Output: Phylogenetic tree
–Accurate but computationally very hard

Distance-based methods

Evolution on a single edge
•Poisson process
–Number of changes in a fixed time interval t is
independent of changes in any other non-overlapping
time interval u
–Number of changes in time interval t is proportional to
the length of the interval
–No changes in time interval of length 0
•Let X be the number of nucleotide changes on a
single edge. We assume X is a Poisson process
•Probability dictates that

Evolution on a single edge
•We want to compute (the probability of a
nucleotide change on edge e)
•The probability of observing a change is just the
sum of probabilities of observing k changes over
all possible values of k (excluding even ones
because those changes cannot be seen)

•Expected number of nucleotide changes on a
given edge is given by
•Key: is additive
Evolution on a single edge

Additivity
•Assume we have a path of k edges and that p1,
p2,…, pk are the probabilities of change on each
edge of the path
•Using induction we can show that
•Multiplicative term is hard to deal with and does
not easily decompose into a product or sum of p
i’s

Additivity
•But the expected number of nucleotide
changes on the path p is elegant

Evolutionary models
•Simple 0,1 alphabet
evolutionary model
–i.i.d. model
–uniformly random root
sequence
•Jukes-Cantor:
–Uniformly random root
sequence
–i.i.d. model

•General Markov
Model
–Uniformly random root
sequence
–i.i.d. model
–For time reversible
models
Evolutionary models

Variation across sites
•Standard assumption of how sites can
vary is that each site has a multiplicative
scaling factor
•Typically these scaling factors are drawn
from a Gamma distribution (or Gamma
plus invariant)

Special issues
•Molecular clock: the expected number of
changes for a site is proportional to time
•No-common-mechanism model: there is a
random variable for every combination of
edge and site

Evolutionary distance estimation

Estimating evolutionary distances
•For sequences A and B what is the evolutionary
distance under the Jukes-Cantor model?
–ACCTGTGGGTAACCACCC
–ACCTGAGGGATAGGTCCG
•But we don’t know what is

Estimating evolutionary distances
•Assume nucleotide changes are Bernoulli trials
(i.i.d. trials of success or failure)
• is probability of head in n Bernoulli trials (n is
sequence length)
•Compute a maximum likelihood estimate for
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1

Estimating evolutionary distance
•We want to find the value of p that maximizes
the probability:
•Set dP/dp to 0 and solve for p to get

Estimating evolutionary distances
• = 5/18
•Continuing in this manner we estimate for
all pairs of sequences in the alignment
•We now have a distance matrix under a
biologically sound evolutionary model
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG

0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1

Distance methods

Distance methods
•UPGMA: similar to hierarchical clustering
but not additive
•Neighbor-joining: more sophisticated and
additive
•What is additivity?

Additivity

UPGMA
UPGMA is not additive but works for
ultrametric trees. Takes O(n^3) time
A DCB
3
33
1010
A
B
C
D
ABCD
62626
2626
6
3

1.Initialize n clusters where each cluster i contains the
sequence i
2.Find closest pair of clusters i, j, using distances in
matrix D
3.Make them neighbors in the tree by adding new node
(ij), and set distance from (ij) to i and j as Dij/2
4.Update distance matrix D: for all clusters k do the
following (ni and nj are size of clusters i and j
respectively)
5.Delete columns and rows for i and j in D and add new
ones corresponding to cluster (ij) with distances as
computed above
6.Goto step 2 until only one cluster is left
UPGMA

UPGMA
A
B
C
D
ABCD
62626
2626
6
A DCB
3
33
1010
3

UPGMA
Doesn’t work (in general) for non-ultrametric
trees
A D
CB10
3
3 3
3
10
A
B
C
D
ABCD
131926
1219
13

UPGMA
UPGMA constructs incorrect tree here
A
B
C
D
ABCD
131926
1219
13
B DAC

UPGMA
Bipartition (BC,AD) is not in true tree
B DAC
A D
CB10
3
3 3
3
10
True tree UPGMA tree

Neighbor joining
1.Additive and O(n^3) time
2.Initialization: same as UPGMA
3.For each species compute
4.Select i and j for which is
minimum
5.Make them neighbors in the tree by adding
new node (ij), and set distance from (ij) to i
and j as

Neighbor joining
6.Update distance matrix D: for all clusters k do
the following
7.Delete columns and rows for i and j in D and
add new ones corresponding to cluster (ij) with
distances as computed above
8.Go to 3 until two nodes/clusters are left

NJ
NJ constructs the correct tree for additive
matrices
A D
CB10
3
3 3
3
10
A
B
C
D
ABCD
131926
1219
13

Simulation studies

Simulation studies
•The true evolutionary tree is never known in
practice. Simulation allows us to study accuracy
of methods under biologically realistic scenarios
•Mathematics behind the phylogenetics is often
complex and challenging. Simulation allows us
to study algorithms when not possible
theoretically and also examine algorithm
performance under various conditions such as
different evolutionary rates, sequence lengths,
or numbers of taxa

Statistical consistency
•As sequence lengths tend to infinity the distance
estimation improves and eventually leads to the true
additive matrix
•If a method like NJ is then applied we get the true tree.
•In practice, however, we have limited sequence length.
Therefore we want to know how much sequence length a
method requires to achieve low error

Convergence rates
•Can be studied
experimentally or
theoretically
•Theoretical results
offer loose bounds
•Experiments (under
simulation) provide
more realistic bounds
on sequence lengths

Sequence length requirements

Sequence length requirements

Typical performance study

Sequence lengths for NJ
Sequence lengths required to obtain 90% accuracy

Error rate of NJ
Tags