UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bioinformatics Algorithms and
Data Structures
Chapter 2: KMP Algorithm
Lecturer: Dr. Rose
Slides by: Dr. Rose
January 28, 2003
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Preliminaries:
–KMP can be easily explained in terms of finite
state machines.
–KMP has a easily proved linear bound
–KMP is usually not the method of choice
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Recall that the naïve approach to string
matching is Q(mn).
•How can we reduce this complexity?
–Avoid redundant comparisons
–Use larger shifts
•Boyer-Moore good suffix rule
•Boyer-Moore extended bad character rule
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•KMP finds larger shifts by recognizing
patterns in P.
–Let sp
i(P) denote the length of the longest
proper suffix of P[1..i] that matches a prefix of
P.
–By definition sp
1= 0 for any string.
–Q: Why does this make sense?
–A: The proper suffix must be the empty string
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Example: P= abcaeabcabd
–P[1..2] = ab hence sp
2= ?
–sp
2= 0
–P[1..3] = abc hence sp
3= ?
–sp
3= 0
–P[1..4] = abca hence sp
4= ?
–sp
4= 1
–P[1..5] = abcae hence sp
5= ?
–sp
5= 0
–P[1..6] = abcaea hence sp
6= ?
–sp
6= 1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Example Continued
–P[1..7] = abcaeab hence sp
7= ?
–sp
7= 2
–P[1..8] = abcaeabc hence sp
8= ?
–sp
8= 3
–P[1..9] = abcaeabca hence sp
9= ?
–sp
9= 4
–P[1..10] = abcaeabcab hence sp
10= ?
–sp
10= 2
–P[1..11] = abcaeabcabd hence sp
11= ?
–sp
11= 0
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Like the a/aconcept for Boyer-Moore, there is
an analogous sp
i/sp´
iconcept.
•Let sp´
i(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P, with
the added condition that characters P(i + 1) and
P(sp´
i+ 1) are unequal.
•Example: P = abcdabce sp´
7= 3
Obviously sp´
i(P) <= sp
i(P), since the later is less
restrictive.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•KMP Shift Rule:
1.Mismatch case:
•Let position i+1 in Pand position kin T be the first mismatch
in a left-to-right scan.
•Shift Pto the right, aligning P[1..sp´
i] with T[k-sp´
i..k-1]
2.Match case:
•If no mismatch is found, an occurrence of Phas been found.
•Shift Pby n–sp´
nspaces to continue searching for other
occurrences.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Observations:
–The prefix P[1..sp´
i] of the shifted Pis shifted to match
the corresponding substring in T.
–Subsequent character matching proceeds from position
sp´
i+ 1
–Unlike Boyer-Moore, the matched substring is not
compared again.
–The shift rule based on sp´
iguarantees that the exact
same mismatch won’t occur at sp´
i+ 1 but doesn’t
guarantee that P(sp´
i+1) = T(k)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Example: P= abcxabcde
–If a mismatch occurs at position 8, Pwill be shifted 4
positions to the right.
–Q: Where did the 4 position shift come from?
–A: The number of position is given by i-sp´
i, in this
example i= 7, sp´
7= 3, 7 –3 = 4
–Notice that we know the amount of shift without
knowing anything about Tother than there was a
mismatch at position 8..
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
•Example Continued: P= abcxabcde
–After the shift, P[1..3] lines up with T[k-4..k-1]
–Since it known that P[1..3] must match T[k-4..k-1], no
comparison is needed.
–The scan continues from P(4) & T(k)
•Advantages of KMP Shift Rule
1.Pis often shifted by more than 1 character, (i-sp´
i)
2.The left-most sp´
icharacters in the shifted Pare known
to match the corresponding characters in T.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
KMP Algorithm
Full Example: T= xyabcxabcxadcdqfeg P = abcxabcde
Assume that we have already shifted past the first two
positions in T.
xyabcxabcxadcdqfeg
abcxabcde
^
1
^
2
^
3
^
4
^
5
^
6
^
7
^
8 d!=x, shift 4 places
abcxabcde
^
1 start again from position 4
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Preprocessing for KMP
Approach:show how to derive sp´values from Zvalues.
Definition:Position j> 1 maps toiif i= j + Z
j(P) –1
–Recall that Z
j(P) denotes the length of the Z-box starting at
position j.
–This says that jmaps to iif iis the right end of a Z-box starting at
j.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Preprocessing for KMP
Theorem. For any i> 1, sp´
i(P) = Z
j=i –j + 1
Where j > 1 is the smallest position that maps toi.
If jthen sp´
i(P) = 0
Similarly for sp:
For any i> 1, sp
i(P) = i –j + 1
Where j, ij > 1, is the smallest position that maps toi
or beyond.
If jthen sp
i(P) = 0
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Preprocessing for KMP
Given the theorem from the preceding slide, the sp´
iand sp
i
values can be computed in linear time using Z
ivalues:
For i= 1 to n{ sp´
i= 0;}
For j= ndownto 2 {
i= j+ Z
i(P) –1;
sp´
i= Z
i;
}
sp
n(P) = sp´
n(P);
For i= n-1 downto 2 {
sp
i(P) = max[sp
i+1(P) -1, sp´
i(P)];}
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Preprocessing for KMP
Defn. Failure function F´(i) = sp´
i-1+ 1 , 1 in+ 1,
sp´
0= 0
(similarly F(i) = sp
i-1+ 1 , 1 in+ 1, sp
0= 0)
•Idea:
–We maintain a pointer iin Pand cin T.
–After a mismatch at P(i+1) with T(c), shift Pto align
P(sp´
i+ 1) with T(c), i.e., i= sp´
i+ 1.
–Special case 1: i= 1 set i= F´(1) = 1 & c= c+ 1
–Special case 2: we find Pin T, shift n-sp´
nspaces,
i.e., i= F´(n+ 1) = sp´
n+ 1.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Full KMP Algorithm
Preprocess Pto find F´(k) = sp´
k-1+1 for kfrom 1 to n+ 1
c= 1; p= 1;
While c+ (n–p) m{
While P(p) = T( c)and pn{
p= p+ 1;
c= c+ 1;}
If (p= n+ 1) then
report an occurrence of Pat position c–nof T.
if (p= 1) then c = c + 1;
p= F´(p) ;}
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Full KMP Algorithm
xyabcxabcxabcdefeg
abcxabcde
^
1 a!=x
p!= n+1
p = 1! c = 2
p= F’(1) = 1
c= 1; p= 1;
While c+ (n–p) m{
While P(p) = T( c)and pn{
p= p+ 1;
c= c+ 1;}
If (p= n+ 1) then
report an occurrence of Pat position c–nof T.
if (p= 1) then c = c + 1;
p= F´(p) ;
}
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Full KMP Algorithm
xyabcxabcxabcdefeg
abcxabcde
^
1 a!=y
p!= n+1
p = 1! c = 3
p= F’(1) = 1
c= 1; p= 1;
While c+ (n–p) m{
While P(p) = T( c)and pn{
p= p+ 1;
c= c+ 1;}
If (p= n+ 1) then
report an occurrence of Pat position c–nof T.
if (p= 1) then c = c + 1;
p= F´(p) ;
}
abcxabcde
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Full KMP Algorithm
xyabcxabcxabcdefeg
p!= n+1
p = 8! don’t change c
p= F´(8) = 4
abcxabcdeabcxabcde
^
1
^
2
^
3
^
4
^
5
^
6
^
7
^
8 d!=x
c= 1; p= 1;
While c+ (n–p) m{
While P(p) = T( c)and pn{
p= p+ 1;
c= c+ 1;}
If (p= n+ 1) then
report an occurrence of Pat position c–nof T.
if (p= 1) then c = c + 1;
p= F´(p) ;
}
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
p= 4, c= 10
^
4
Full KMP Algorithm
xyabcxabcxabcdefeg
p= n+1 !
abcxabcde
^
5
^
6
^
7
^
8
abcxabcdeabcxabcdeabcxabcde
c= 1; p= 1;
While c+ (n–p) m{
While P(p) = T( c)and pn{
p= p+ 1;
c= c+ 1;}
If (p= n+ 1) then
report an occurrence of Pat position c–nof T.
if (p= 1) then c = c + 1;
p= F´(p) ;
}
^
9
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Real-Time KMP
•Q: What is meant by real-time algorithms?
•A: Typically these are algorithms that are meant
to interact synchronously in the real world.
–This implies a known fixed turn-around time for
processing a task
–Many embedded scheduling systems are examples
involving real-time algorithms.
–For KMP this means that we require a constant time
for processing all strings of length n.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Real-Time KMP
•Q: Why is KMP not real-time?
•A: For any mismatched character in T, we may
try matching it several times.
–Recall that sp´
ionly guarantees that P(i+ 1) and P(sp´
i+ 1) differ
–There is NO guarantee that P(i+ 1) and T(k) match
•We need to ensure that a mismatch at T(k) does
NOT entail additional matches at T(k).
•This means that we have to compute sp´
ivalues
with respect to all characters in Ssince any could
appear in T.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Real-Time KMP
•Define:sp´
(i,x)(P) to be the length of the longest
proper suffix of P[1..i] that matches a prefix of
P, with the added condition that character P(sp´
i
+ 1) is x.
•This is will tell us exactly what shift to use for
each possible mismatch.
•A mismatched character T(k) will never be
involved in subsequent comparisons.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Real-Time KMP
•Q: How do we know that the mismatched
character T(k) will never be involved in
subsequent comparisons?
•A: Because the shift will shift Pso that either the
matching character aligns with T(k) or Pwill be
shifted past T(k).
•This results in a real-time version of KMP.
•Let’s consider how we can find the sp´
(i,x)(P)
values in linear time.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Real-Time KMP
Thm.For P[i+ 1] x, sp´
(i,x)(P) = i-j+ 1
–Here jis the smallest position such that jmaps to iand
P(Z
j+ 1) = x.
–If there is no such jthen where sp´
(i,x)(P) = 0
For i= 1 to n{ sp´
(i,x)= 0 for every character x;}
For j= ndownto 2 {
i= j+ Z
i(P) –1;
x= P(Z
j+ 1);
sp´
(i,x)= Z
i;
}
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Real-Time KMP
•Notice how this works:
–Starting from the right
•Find ithe right end of the Z box associated with j
•Find xthe character immediately following the prefix
corresponding to this Zbox.
•Set sp´
(i,x)= Z
i, the length of this Zbox.
For i= 1 to n{ sp´
(i,x)= 0 for every character x;}
For j= ndownto 2 {
i= j+ Z
i(P) –1;
x= P(Z
j+ 1);
sp´
(i,x)= Z
i;}