Horspool Algorithm in Design and Analysis of Algorithms in VTU

2,339 views 37 slides Feb 27, 2024
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

Algorithm


Slide Content

Advanced Data Structure: Bioinformatics
•First week: Algorithms for exact string matching.
•Second week: Approximate search and alignment
of short sequences.
•Third week: Dealing with long sequences.

Advanced Data Structure:bibliography
•Bioinformatics, Sequence and Genome Analysis
David W. Mount
•Flexible Pattern Matching in Strings (2002)
Gonzalo Navarro and Mathieu Raffinot
•http://www-igm.univ-mlv.fr/~lecroq/string/index.html
•http://www.ncbi.nlm.nih.gov/

First week
•First week: algorithms for exact string matching:
One pattern: The algorithm depends on |p| and |
k patterns: The algorithm depends on k, |p| and ||
•Second week: approximate search and alignment
of short sequences.
•Third week: dealing with long sequences.

Exact stringmatching for one pattern
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA.
How does the string algorithms made the search?
and for the pattern TACTACGGTATGACTAA

Exact stringmatching: Brute force algorithm
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Example:

Exact stringmatching: Brute force algorithm
Text :
Pattern :
From left to right: prefix
•Which is the next position of the window?
•How the comparison is made?
Pattern :
Text :
The window is shifted only one cell

Exact stringmatching: one pattern
There is a sliding window along the text
against which the pattern is compared:
How does the matching algorithms made the search?
Pattern :
Text :
Which are the facts that differentiate the algorithms?
1.How the comparison is made.
2.The length of the shift.
At each step the comparison is made and
the window is shifted to the right.

Exact stringmatching for one pattern
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256e
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w

Horspool algorithm
Text :
Pattern :
Sufix search
•Which is the next position of the window?
•How the comparison is made?
Pattern :
Text :
a
Shift until the next ocurrence of “a” in the pattern:
aaa
aa
a
We need a preprocessing phase to construct the shift table.

Horspool algorithm : example
Given the pattern ATGTA
•The shift table is:
A
C
G
T

Horspool algorithm : example
Given the pattern ATGTA
•The shift table is:
A 4
C
G
T

Horspool algorithm : example
Given the pattern ATGTA
•The shift table is:
A 4
C 5
G
T

Horspool algorithm : example
Given the pattern ATGTA
•The shift table is:
A 4
C 5
G 2
T

Horspool algorithm : example
Given the pattern ATGTA
•The shift table is:
A 4
C 5
G 2
T 1

Horspool algorithm : example
Given the pattern ATGTA
•The shift table is:
A 4
C 5
G 2
T 1
•The searching phase:G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A

Horspool algorithm: example
Given the pattern ATGTA
•The shift table is:
A 4
C 5
G 2
T 1
•The searching phase:G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A

Some questions about Horspool algorithm
A 4
C 5
G 2
T 1
Given a random text over an equally likely
probability distribution (EPD):
Given the pattern ATGTA, the shift table is
1.-Determine the expected shift of the window. And,
if the PD is not equally likely?
2.-Determine the expected number of shifts
assuming a text of length n.
3.-Determine the expected number of comparisons
in the suffix search phase

Exact stringmatching for one pattern
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w

Text :
Pattern :
Search for suffixes of T that are factors of
BNDM algorithm
•Which is the next position of the window?
•How the comparison is made?
That is denoted as
D
2= 1 0 0 0 1 0 0
Depends on the value of the leftmost bit of D
Once the next character x is read
D
3= D
2<<1 & B(x)
B(x): mask of x in the pattern P.
For instance, if B(x) = ( 0 0 1 1 0 0 0)
D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
x

BNDM algorithm: example
Given the pattern ATGTA
•The searching phase:G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
•The mask of characters is:
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D
1= ( 0 1 0 1 0 )
D
2= ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D
1= ( 0 0 1 0 0 )
D
2= ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )
D
1= ( 1 0 0 0 1 )
D
2= ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D
3= ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )
D
4= ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )

BNDM algorithm: example of window shift
A T G T A
•Given the pattern ATGTA
•The mask of characters is :
•The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D
1= ( 1 0 0 0 1 )
D
2= ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D
3= ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )
D
4= ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )
D
5= ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D
6= ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 )Found

BNDM algorithm: example
Given the pattern ATGTA
•The searching phase:G T A C T A G A A T A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
•The mask of characters is :
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D
1= ( 0 1 0 1 0 )
D
2= ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D
1= ( 0 1 0 1 0 )
D
2= ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D
3= ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
How the shif is determined?

Extended string matching
•Classes of characters: when in some DNA files or patterns there are
new characters as N or R that means N={A,C,G,T} and R={G,A}.
•Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3)
means any 2 or 3 characters.
•Optional characters: we find pattern as AC?ACT?T?A where C?
means that C may or may not appear in the text.
•Wild cards: we find pattern as AT*TA where * means an arbitrary long
string.
•Repeatable characters: we find pattern as AT[TA]*AT where [TA]*
means that TA can appear zero or more times..

Exact string matching for one pattern
Algorismes més eficients (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w

Autòmata Factor Oracle: propietats
Factor Oracle of word G T A T G T A
GG AT T AT
T
A
G
All states are accepting states.
Recognizes all factors … but more, which?
If a word is rejected, it isn't a factor, then

BOM algorithm (Backward Oracle Matching)
•How many cells are shifted?
•How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Checks from right to left
a
•If the a isn't into the automaton
•If we reach the last stat of the automaton with the a
a

BOM algorithm: example
•The automaton of the inverse patterns is built: given the pattern ATGTATG
•And the search is :G T A C T A G A A T G T G T A G A C A T G T A T G G G A...
A T G T A T G
How the comparison is made?
GG AT T AT
T
A
G

BOM algorithm: example
A T G T A T G
How the comparison is made?
GG AT T AT
T
A
G
A T G T A T G
•The automaton of the inverse patterns is built: given the pattern ATGTATG
•And the search is :G T A C T A G A A T G T G T A G A C A T G T A T G G G A...

BOM algorithm: example
A T G T A T G
How the comparison is made?
GG AT T AT
T
A
G
A T G T A T G
A T G T A T G
•The automaton of the inverse patterns is built: given the pattern ATGTATG
•And the search is :G T A C T A G A A T G T G T A G A C A T G T A T G G G A...

BOM algorithm: example
A T G T A T G
How the comparison is made?
GG AT T AT
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
•The automaton of the inverse patterns is built: given the pattern ATGTATG
•And the search is :G T A C T A G A A T G T G T A G A C A T G T A T G G G A...

BOM algorithm: example
A T G T A T G
GG AT T AT
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?•The automaton of the inverse patterns is built: given the pattern ATGTATG
•And the search is :G T A C T A G A A T G T G T A G A C A T G T A T G G G A...

BOM algorithm: example
A T G T A T G
GG AT T AT
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?•The automaton of the inverse patterns is built: given the pattern ATGTATG
•And the search is :G T A C T A G A A T G T G T A G A C A T G T A T G G G A...

Automata Factor Oracle
Given the pattern GTATA, in which state the factors are accepted?
G AT
T
A
GGT
T
GTA
TA
A
When the new A is read, 5 factors
should be accepted GTATA
TATA
ATA
TA
A, how it can be
reached?
GTAT
TAT
AT
T
TG AT
T
A
GGT
T
GTA
TA
A
When the new T is read, 4 factors should be
accepted GTAT
TAT
AT
T, how it can be reached?

Automata Factor Oracle
When the new
G is read, 6
factors should
be accepted
GTATAG
TATAG
ATAG
TAG
AG
G
GTATA
TATA
ATA
TA
A
GTAT
TAT
AT
T
TG AT
T
A
GGT
T
GTA
TA
A
A G
GTATAG
TATAG
ATAG
TAG
AG
G

Automaton Factor Oracle: linear algorithm
?

Autòmata Factor Oracle: algorisme
If there is a T transition ...
TT

Autòmata Factor Oracle: algorisme
… and recursively continue ...
T
T
But if there isn't a T transition ...
Tags