combinatorial optimization in biology .ppt

SuperBugs1 8 views 83 slides Aug 09, 2024
Slide 1
Slide 1 of 83
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83

About This Presentation

CO in Biology.


Slide Content

Combinatorial Optimization in
Computational Biology
Dan Gusfield
Computer Science, UC Davis

Some Applications of
optimization methods
•Dynamic Programming in sequence alignment
•TSP and Euler paths in DNA sequencing
•Integer Programming in Haplotype inference
•Graphic Matroid Recognition in Haplotype
inference
•Integer Programming for Virus Identification
•Weighted matching in genetic partition analysis

Apology
This talk focuses (mostly) on my own work, for simplicity of
preparation. But, many other wonderful examples exist
in the literature.
At present, there is no coherent story about the ways that
combinatorial optimization arises in Computational
Biology. Must be alert for meaningful applications.

Topic 1: Sequence Alignment
Intended to show the similarity of two sequences (strings).
Definition: An alignment of two strings S and S’ is obtained
by inserting spaces into, or at the ends of, the two strings, so
that the resulting strings (including the spaces) have the same
lengths.
A -- T C -- A
-- C T C A A

Matches, Mismatches and Indels
Two aligned, identical characters in an alignment are a match.
Two aligned, unequal characters are a mismatch.
A character aligned with a space, represents an indel
(insertion/deletion).
A A C T A C T -- C C T A A C A C T -- --
-- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
the bars show mismatches

Basic Algorithmic Problem
Biologically, under the point mutation model, a mismatch
represents a mutation, while an indel represents a historical
insertion or deletion of a single character.
Find an alignment of the two strings that
Maximizes the # matches - # mismatches - # spaces
in the alignment of strings S and S’.
That maximum number defines the similarity of the two
strings.
Also called Optimal Global Alignment

Solution By Dynamic
Programming
Def: Let V(I, J) be the Maximimum Value obtainable in
any alignment of the first I characters of S and the first
J characters of S’.
If S has N characters and S’ has M characters, then
we want to compute V(N, M), and the associated
alignment.
Let M(I, J) = -1 if S(I) mismatches S’(J) and
M(I, J) = 1 if S(I) matches S’(J)

DP Solution

V(I,0)=−I,
V(0,J)=−J,
V(I,J)=Max
V(I−1,J−1)+M(I,J)
V(I−1,J)−1
V(I,J−1)−1










The DP recurrences are:
They can be evaluated in O(NM) operations.
I and J aligned
I opposite a space
J opposite a space
General Cases
Base Cases

Many Variations on the Theme
The most important alignment variant is Optimal Local Alignment:
Given S and S’ find a substring A of S, and a substring B of S’
such that the value of the Global Alignment of A and B is
maximized over all pairs of substrings of S and S’.
The point is to find a pair of substrings of S and S’ which are
highly similar to each other, even if S and S’ are not. Many
biological reasons for this: DNA problems, Protein Domains etc.
Naively, we would need to do O((NM)**2) global alignments,
for a time bound of O((NM)**3) operations overall.

DP solution to Local Alignment

V(I,0)=V(0,J)=0
V(I,J)=MAX
V(I−1,J−1)+M(I,J)
V(I−1,J)−1
V(I,J−1)−1
0












But the optimal value is no longer V(N, M) as in global alignment.
Instead the optimal Local Alignment Value is the largest V
value computed over all the NM values. The time is
again O(NM). This is the Smith-Waterman Local Alignment
Algorithm.

Adding Operation Weights
Find an alignment of the two strings that
Maximizes
Vm x (# matches) - Cm x (# mismatches) - Cs x (# spaces)
in the alignment of strings S and S’.
The modification to the DP is immediate, but now the
modeling question is what Vm, Cm and Cs should be.
Those values greatly determine the biological utility of
the alignment.
This leads to Parametric Alignment: Find the optimal alignment
as a function of unknown values of Vm, Cm, and Cs

Parametric Sequence Alignment
As a function of two selected weight parameters Vm, Cm, Cs, Cg,
decompose the two dimensional parameter space into convex
polygons, so that in each polygon, an alignment inside the polygon
is optimal for any point in the polygon, and nowhere else.
The time for this decomposition is O(NM) per polygon.
In the case of global alignment, the number of polygons is bounded
by 0.88(N^(2/3)), where N < M.
package XPARAL to find the polygons is available on the web.
Gusfield, Naor, Stelling in several papers (see the web)

Adding Gaps to the Model
A Gap in an alignment is a contiguous run of spaces in one sequence.
Example: A T T -- -- -- C C --
-- -- T C T G C C C
has six spaces in three gaps
Now the objective function for alignment is Maximize
Vm x (#matches) - Cm x (#mis) - Cs x (#spaces) - Cg x (#gaps)
This is the affine gap model, the dominant model for studying
protein sequence similarity. The choice of Cs and Cg are critical for the
biological utility. The DP in this case can also be solved in O(NM) time.

Topic 2: Sequencing By
Hybridization
The technology allows you to determine all the K-tuples,
for a fixed K, that are contained in a DNA string S.
The goal is to try to determine the string S from the given
K-tuples.
One can cast this problem as a Hamilton Path problem on
a graph, but it is more productive to view it as an Euler
Path Problem.
This approach was introduced by Pavel Pavzner.
A proposed approach to sequencing DNA

I will focus on the results in
Graph Traversals, Genes and Matroids: An Efficient Special Case
of the Traveling Salesman Problem
D. Gusfield, R. Karp, L. Wang, P. Stelling
Discrete Applied Math 88 special issue on
Computational Molecular Biology, 1998, p. 167-180.

Sequencing by Hybridization
Example: S =
ACTTAAACGCGCACAA
K = 3
So we learn that ACT, CTT etc. are in S, but we learn these
in no predictable order.

The Hamilton Path View
Each node represents a K-tuple, and a directed edge from one
node to another if the second K-tuple can be obtained from the
first by deleting the left character, and adding a right character:
i.e, a shift and add.
ACG
CGC
add a C on the
left
Then a Hamilton Path in this graph generates a sequence S’
that has the same K-tuples as S

The Euler Path View
The Euler Path view is more productive, leading to more
efficient algorithms both for the base problem, and for
variants of the problem.
Due to P. Pevzner

AA AC CA
TA
TT
CT CG
GC
A
A
A
T
T
C
A
A
C
A
G
C
Observed triples from S: AAA, AAC, ACA, CAA, CAC, ACG, CGC,
GCA, GCC, ACT, CTT, TTA, TAA
Graph G: One node for every k-1 tuple, and one
edge for every observed k-tuple

Any Euler path in G produces a string which has the same
K-tuples as the original string S. So if there is only one
Euler path in G, then you have reconstructed S.
The problem is that typically there is more than one Euler
path. So how to choose one Euler path as a most promising
one? That is, one that is most likely to correspond to S?
Frequently there is additional information about the K-tuples,
for example the rough distances between K-tuples in S. So,
we can give a value to each pair of successive edges in G,
which roughly reflects the probability that the two edges
would appear successively in the Euler Path for S. Another
way to state this is that we evaluate the goodness of S’
according to the (k+1)-tuples that it contains.

Back to the Hamilton View
Now we are back to the Hamilton Path view, where each
node is a K-tuple of S, but now each
edge has a distance reflecting the goodness of any (K+1)-tuple
that would be generated by a Hamilton path, and we want the
Maximum distance Hamilton Path in the graph.
This sounds hard, but there is structure in this problem: the
graphs are not arbitrary.

The structure of the Hamilton
Graph
The Hamilton Graph for the data, where nodes represent
K-tuples in S, and edges represent (K+1)-tuples in a
solution string S’, is the line di-graph of the Euler Graph
for the data, where nodes represent (K-1) tuples, and
edges represent K-tuples.

How Euler Relates to Hamilton

The Hamilton graph is the
line di-graph L(G) of the Euler Graph G.
Each red edge has the value of the successive pair of black edges.
Each edge of G becomes a node in L(G), and each edge in L(G)
represents a successive pair of edges in G.
So the problem now is to find a Maximum Value TSPath.

In the case that each node has only two in and two out edges
(which happens when using a simplified DNA alphabet),
this TSP problem has a polynomial time solution.
In fact, if L(G) is any line digraph where every node has
in and out degrees at most 2, then even for arbitrary edge
distances, the TSP problem can be solved in O(n log n) time.
Moreover, if every edge value is distinct, then the TSP solution
is unique.
The Hamilton paths for such 2-in, 2-out-degree graphs have
a matroid structure that reveals many features of the TSP solutions.
When degrees are greater than 2, we branch-and-bound down to
cases of degree 2, which we then can solve optimally.

Approximations
When the in and out-degrees are bounded by X, a greedy
algorithm is guaranteed to find a Hamilton path whose
value is within 1/X that of the optimal. So 1/3 or 1/4
for the case of DNA.
Use of this approximation can speed up the branch-and-
bound until the in and out-degree bound of 2 is reached.

Topic 3: Combinatorial
Algorithms for Haplotype
Inference
True Parsimony and Perfect
Phylogeny
Dan Gusfield

Genotypes and Haplotypes
Each individual has two “copies” of each
chromosome.
At each site, each chromosome has one of two
alleles (states) denoted by 0 and 1 (motivated by
SNPs)
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
Merge the haplotypes

Haplotyping Problem
•Biological Problem: For disease association studies,
haplotype data is more valuable than genotype data, but
haplotype data is hard to collect. Genotype data is easy to
collect.
•Computational Problem: Given a set of n genotypes,
determine the original set of n haplotype pairs that
generated the n genotypes. This is hopeless without a
genetic model.

We consider two models in this
talk
•Pure (True) Parsimony Model, Gusfield
unpublished (Note added 3/18/03 technical
report is available on the web, and will
appear in the 2003 CPM conference)
•Perfect Phylogeny Model, Gusfield
RECOMB Proceedings April 2002

Topic 3a: The Pure Parsimony
Objective
For a set of genotypes, find a Smallest set H of
haplotypes, such that each genotype can be
explained by a pair of haplotypes in H.

Example of Parsimony
02120
00100
01110
22110 01110
10110
20120 00100
10110
3 distinct haplotypes
set S has size 3

Pure Parsimony is NP-hard
Earl Hubbel (Affymetrix) showed that Pure Parsimony
is NP-hard.
However, for a range of parameters of current interest
(number of sites and genotypes) a Pure Parsimony
solution can be computed efficiently, using Integer
Linear Programming, and one speed-up trick.
For larger parameters (100 sites and 50 genotypes)
A near-parsimony solution can be found efficiently.

The Conceptual Integer
Programming Formulation
For each genotype (individual) j, create one integer
programming variable Yij for each pair of haplotypes
whose merge creates genotype j. If j has k 2’s, then
This creates 2^(k-1) Y variables.
Create one integer programming variable Xq for
Each distinct haplotype q that appears in one of the
pairs for a Y variable.

Conceptual IP
For each genotype, create an equality that says that
exactly one of its Y variables must be set to 1.
For each variable Yij, whose two haplotypes are
given variables Xq and Xq’, include an inequality
that says that if variable Yij is set to 1, then both
variables Xq and Xq’ must be set to 1.
Then the objective function is to Minimize the
sum of the X variables.

Example
02120 Creates a Y variable Y1 for pair 00100 X1
01110 X2
and a Y variable Y2 for pair 01100 X3
00110 X4
Y1 + Y2 = 1
Y1 - X1 <= 0
Y1 - X2 <= 0
Y2 - X3 <= 0
Y2 - X4 <= 0
Include the following (in)equalities into the IP
The objective function will
include the subexpression
X1 + X2 + X3 + X4
But any X variable is included
exactly once no matter how many
Y variables it is associated with.

Efficiency Trick
Ignore any Y variable and its two X variables if those X
variables are associated with no other Y variable. The
Resulting IP is much smaller, and can be used to find
the optimal to the conceptual IP.
Also, we need not first enumerate all X pairs for a given
genotype, but can efficiently recognize the pairs we
need. Another simple trick.

How Fast? How Good?
Depends on the level of recombination in the underlying
data. Pure Parsimony can be computed in seconds to
minutes for most cases with 50 genotypes and up to 60
sites, faster as the level of recombination increases.
As the level of recombination increases, the accuracy
of the Pure Parsimony Solution falls, but remains within
10% of the quality of PHASE (for comparison).

Topic 3b: Haplotyping via
Perfect Phylogeny
Conceptual Framework and Efficient
(almost linear-time) Solutions
Dan Gusfield
U.C. Davis

The Perfect Phylogeny Model
We assume that the evolution of extant haplotypes can be
displayed on a rooted, directed tree, with the all-0
haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each extant
haplotype is created by accumulating the changes on a
path from the root to a leaf, where that haplotype is
displayed.
In other words, the extant haplotypes evolved along a
perfect phylogeny with all-0 root.

The Perfect Phylogeny Model
00000
1
2
4
3
5
10100
10000
01011
00010
01010
12345sites
Ancestral haplotype
Extant haplotypes at the leaves
Site mutations on edges

Justification for Perfect
Phylogeny Model
•In the absence of recombination each haplotype of
any individual has a single parent, so tracing back
the history of the haplotypes in a population gives
a tree.
•Recent strong evidence for long regions of DNA
with no recombination. Key to the NIH haplotype
mapping project. (See NYT October 30, 2002)
•Mutations are rare at selected sites, so are assumed
non-recurrent.
•Connection with coalescent models.

The Haplotype Phylogeny Problem
Given a set of genotypes S, find an explaining set
of haplotypes that fits a perfect phylogeny.
12
a22
b02
c10
sites
A haplotype pair explains a
genotype if the merge of the
haplotypes creates the
genotype. Example: The
merge of 0 1 and 1 0 explains
2 2.

Genotype matrix
S

The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
12
a22
b02
c10
12
a10
a01
b00
b01
c10
c10

The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
12
a22
b02
c10
12
a10
a01
b00
b01
c10
c10
1
c
caa
b
b
2

101010
0101
00
00

The Alternative Explanation
12
a22
b02
c10
12
a11
a00
b00
b01
c10
c10
No tree
possible
for this
explanation

When does a set of haplotypes to
fit a perfect phylogeny?
Classic NASC: Arrange the haplotypes in a
matrix, two haplotypes for each individual.
Then (with no duplicate columns), the
haplotypes fit a unique perfect phylogeny if
and only if no two columns contain all three
pairs:
0,1 and 1,0 and 1,1
This is the 3-Gamete Test

The Alternative Explanation
12
a22
b02
c10
12
a11
a00
b00
b01
c10
c10
No tree
possible
for this
explanation

12
a22
b02
c10
12
a10
a01
b00
b01
c10
c10
1
c
caa
b
b
2
0 0
0 10 1
0 0
The Tree Explanation Again

Solving the Haplotype Phylogeny
Problem (PPH)
•Simple Tools based on classical Perfect
Phylogeny Problem.
•Complex Tools based on Graph Realization
Problem (graphic matroid realization).

Simple Tools – treeing the 1’s
1)For any row i in S, the set of 1 entries in row
i specify the exact set of mutations on the
path from the root to the least common
ancestor of the two leaves labeled i, in every
perfect phylogeny for S.
2)The order of those 1 entries on the path is
also the same in every perfect phylogeny for
S, and is easy to determine by “leaf
counting”.

Leaf Counting
1234567
a1010000
b0101000
c2200202
d2200020
In any column c, count two for each 1, and
count one for each 2. The total is the number
of leaves below mutation c, in every perfect
phylogeny for S. So if we know the set of
mutations on a path from the root, we know
their order as well.
S
Count 4 4 2 2 1 1 1

The Subtree of 1’s is Forced
The columns of S with at least one 1 entry
specify a unique upper subtree that must be
in every perfect phylogeny for S.
1234567
a1010000
b0101000
c2200202
d2200020
S
1
4
3
2
aa
bb
4 4 2 2 1 1 1

Corollary: A Useful Sufficient
Condition
If every column of S has at least one 1, then
there is a unique perfect phylogeny for S, if
there is any.
Build the forced tree based on the 1-entries,
then every mutation (site) is in the tree, and
any 2-entries dictate where the remaining
leaves must be attached to the tree.

What to do when the Corollary
does not apply?
•Build the forced tree of 1-entries.
•2-entries in columns with 1-entries are
done.
•Remaining problem: 2-entries in columns
without any 1-entries

Uniqueness without 1’s
A 0 2 2
B 2 2 0
C 2 0 2
There are 8 haplotype solutions
that can explain the data, but
only one that fits a perfect
phylogeny, i.e. only one solution
to the PPH problem.

More Simple Tools
3)For any row i in S, and any column c, if
S(i,c) is 2, then in every perfect phylogeny
for S, the path between the two leaves
labeled i, must contain the edge with
mutation c.
Further, every mutation c on the path
between the two i leaves must be from
such a column c.

From Row Data to Tree
Constraints
Root
The order is
known for the red
mutations
1 2 3 4 5 6 7
i:0 1 0 2 2 1 2
i i
Subtree for row i data
4,5,7
order
unknown
2
6
sites

The Graph Theoretic Problem
Given a genotype matrix S with n sites, and a
red-blue fork for each row i,
create a directed tree T where each
integer from 1 to n labels exactly one
edge, so that each fork is
contained in T.ii

Complex Tool: Graph
Realization
•Let Rn be the integers 1 to n, and let P be an unordered
subset of Rn. P is called a path set.
•A tree T with n edges, where each is labeled with a unique
integer of Rn, realizes P if there is a contiguous path in T
labeled with the integers of P and no others.
•Given a family P1, P2, P3…Pk of path sets, tree T realizes
the family if it realizes each Pi.
•The graph realization problem generalizes the consecutive
ones problem, where T is a path.

Graph Realization Example
1
2
4
5
6
3
8
7
P1: 1, 5, 8
P2: 2, 4
P3: 1, 2, 5, 6
P4: 3, 6, 8
P5: 1, 5, 6, 7
Realizing Tree T

Graph Realization
Polynomial time (almost linear-time) algorithms
exist for the graph realization problem – Whitney,
Tutte, Cunningham, Edmonds, Bixby, Wagner,
Gavril, Tamari, Fushishige.

The algorithms are not simple; no known
implementations of the near linear-time methods.
But an implementation from UCD CS of a slightly
slower method is available on the web.

Recognizing graphic Matroids
•The graph realization problem is the same
problem as determining if a binary matroid is
graphic, and the algorithms come from that
literature.
•The fastest algorithm is due to Bixby and
Wagner. For n rows and m columns the time is
O(nm x alpha(nm).
•Representation methods due to Cunningham et al.

Reducing PPH to graph
realization
We solve any instance of the PPH problem
by creating appropriate path sets, so that a
solution to the resulting graph realization
problem leads to a solution to the PPH
problem instance.
The key issue: How to encode a fork by
path sets.

From Row Data to Tree
Constraints
Root
The order is
known for the red
mutations
1 2 3 4 5 6 7
i:0 1 0 2 2 1 2
i i
Subtree for row i data
4,5,7
order
unknown
2
6
sites

Encoding a Red path, named g
2
5
7
P1: U, 2
P2: U, 2, 5
P3: 2, 5
P4: 2, 5, 7
P5: 5, 7
P6: 5, 7, g
P7: 7, g
2
5
7
g
U
U is a universal glue edge, used for all red paths.
g is used for all red paths identical to this one – i.e.
different forks with the identical red path.
forced
In T

Encoding and gluing a blue path
For a specific blue path, simply specify a single
path set containing
the integers in the blue path. However, we need to
force that path to be glued to the end of a specific
red path g, or to the root of T.
In the first case, just add g to the path set. In the
second case, add U to the path set.

Encoding and gluing a blue path
to the end of red path g
2
5
7
4, 9, 18, 20
g
P: g, 4, 9, 18, 20

After the Reduction
After all the paths are given to the graph
realization algorithm, and a realizing tree T is
obtained (assuming one is), contract all the glue
edges to obtain a perfect phylogeny for the
original haplotyping problem (PPH) instance.
Conversely, any solution can be obtained in
this way. The reduction takes time O(nm) for n
rows and m columns.

Uniqueness
In 1932, Whitney established the NASC for the uniqueness
of a solution to a graph realization problem:
Let T be the tree realizing the family of required path sets.
For each path set Pi in the family, add an edge in T
between the ends of the path realizing Pi. Call the result
graph G(T).
T is the unique realizing tree, if and only if
G(T) is 3-vertex connected. e.g., G(T) remains connected
after the removal of any two vertices.

Uniqueness of PPH Solution
•Apply Whitney’s theorem, using all the
paths implied by the reduction of the PPH
problem to graph realization.
•(Minor point) Do not allow the removal of
the endpoints of the universal glue edge U.

Multiple Solutions
•In 1933, Whitney showed how multiple solutions to the
graph realization problem are related.
•Partition the edges of G into two, connected graphs, each
with at least two edges, such that the two graphs have
exactly two nodes in common. Then twist one graph
around those nodes.
•Any solution can be transformed to another via a series of
such twists.

Twisting Example
x y
x y
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
All the cycle sets are preserved by twisting

Representing all the solutions
All the solutions to the PPH problem can be
implicitly represented in a data structure which
can be built in linear time, once one solution is
known. Then each solution can be generated in
linear time per solution. Method is a small
modification of ones developed by Cunningham
and Edwards, and by Hopcroft and Tarjan.
See Gusfield’s web site for errata.

The biological meaning
Each subgraph in the representation identifies a set of
mutations (columns) whose 0/1 settings are invariant
over all PPH solutions. However, by switching all
0/1 values that had been 2, in all of those mutations,
the relative 0/1
settings between columns in different subgraphs can
be set arbitrarilly, and all PPH solutions can be
generated in this way.

1 2 2 2 0 0 0
1 2 2 2 0 0 0
2


0 2 0 0 0 2
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 2 0 2 0
1 2 2 0 2 0 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
2 2 0 0 0 2 0
a
a
b
b
c
c
d
e
d
e
1 2 3 4 5 6 7
An example.
Each row starts
duplicated for
simplicity.

1 1 0 0 0 0 0
1 0 1 1 0 0 0
1


0 1 0 0 0 0
0 0 0 0 0 0 1
1 1 0 0 1 0 0
1 0 1 1 0 0 0
1 1 0 0 0 1 0
1 0 1 0 0 0 0
1 1 0 0 1 0 0
0 0 0 0 0 0 0
a
a
b
b
c
c
d
e
d
e
1 2 3 4 6 5 7
Starting from a PPH
Solution, if all
shaded cells in a
block switch value,
then the result is
also a PPH solution,
and any PPH
solution can be
obtained in this
way, i.e. by
choosing in each
block whether to
switch or not.

Secondary information and
optimization
•The partition shows explicitly what added information is
useful and what is redundant. Information about the
relationship of a pair of columns is redundant if and only if
they are in the same block of the column partition. Apply
this successively as additional information is obtained.
•Problem: Minimize the number of haplotype pairs
(individuals) that need be laboratory determined in order to
find the correct tree.
•Minimize the number of (individual, site1, site2) triples
whose phase relationship needs to be determined, in order
to find the correct tree.

The implicit representation of all solutions provides
a framework for solving these secondary problems,
as well as other problems involving the use of
additional information, and specific tree-selection
criteria.

A Phase-Transition
Problem, as the ratio of sites to genotypes changes,
how does the probability that the PPH solution is
unique change?
For greatest utility, we want genotype data where the
PPH solution is unique.
Intuitively, as the ratio of genotypes to sites increases,
the probability of uniqueness increases.

Frequency of a unique solution with 50
and 100 sites, 5% rule and 2500 datasets
per entry
10 0.0018
20 0.0032
22 0.7646
40 0.7488
42 0.9611
70 0.994
130 0.999
140 1
# geno. Frequency of
unique solution
10 0
20 0
22 0.78
40 0.725
42 0.971
60 0.983
100 0.999
110 1

An alternative solution
Recently, we developed an algorithm that is not based
on graph realization, and which is much easier to understand.
However, it runs in O(nm^2) time. See Gusfield’s website.

Implementations
See wwwcsif.cs.ucdavis.edu/~gusfield/
for papers and software.
Tags