What is BLAST?
BasicLocalAlignmentSearchTool
Itallowsrapidsequencecomparisonofaquery
sequenceagainstadatabase.
The BLAST algorithm is fast, accurate, and web-
accessible.
Developed in 1990 and 1997 (S. Altschul)
Why uses BLAST?
BLAST searching is fundamental to understanding the
relatedness of any favorite query sequence to other
known proteins or DNA sequences.
Applications include:-
Identifying orthologs and paralogs
Discovering new genes or proteins
Discovering variants of genes or proteins
Investigating expressed sequence tags (ESTs)
Exploring protein structure and function
Four Essential Components of BLAST
(1) Choose the sequence (query)
(2) Select the BLAST program
(3) Choose the database to search
(4) Choose optional parameters
Then click “BLAST”
MEGA BLAST
ComparisonoflargesetsoflongDNAsequences.
It'smuchfasterthanthestandardBLASTN
ItusesthegreedyalgorithmofWebbMilleretal.for
nucleotidesequencealignmentsearch
Itisuptotentimesfasterthanmorecommonsequence
similarityprogramsandthereforecanbeusedtoquickly
comparetwolargesetsofsequencesagainsteachother.
Suffix tree
Suffixtree,asthenamesuggestsisa
treeinwhicheverysuffixofastringSis
represented.
More formally defined, suffix tree is an
automaton which accepts every suffix
of a string.
Suffix tree
Exampleofsuffixtreeforthestring
“ABC”
1,2and3representtheendsof
suffixesstartingatpositions1,2
and3respectively.Thesearethe
leafnodes.
MUMmer –Genome alignment
algorithm
Developed by
Dr. Steven Salzberg’s group at TIGR
NAR (1999) 27:2369-2376
NAR (2002) 30:2478-2483
Availability
Free
TIGR (The Institute of Genomic Research) site
Features
The algorithm assumes that sequences are closely related
Can quickly compare millions of bases
Outputs:
Base to base alignment
Highlights the exact matches and differences in the
genomes
Locates
SNPs
Large inserts
Significant repeats
Tandem repeats and reversals
Technique used in MUMmer algorithm
Compute Suffix trees for every genome
Longest Increasing Subsequence (LIS)
Alignment using Smith & Waterman algorithm
Integration of
these techniques
for genome alignment
Steps
Locating MUMs
Sorting MUMs
Closure with gaps
G1: ACTGATTACGTGAACTGGATCCA
G2: ACTCTAGGTGAAGTGATCCA
What is MUM?
MUMisasubsequencethatoccursexactlyoncein
bothgenomesandisNOTpartofanylonger
sequence
TwocharactersthatboundaMUMarealways
mismatches
GenA:tcgatcGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAA cgactta
GenB:gcattaGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAA tccagag
Similar to
BLAST & FASTA!!
Sorting & ordering MUMs
MUMs are sorted according to their position in
Genome A
The order of matching MUMs in Genome B is
considered
LIS algorithm to locate longest set of MUMs which
occur in ascending order in both genomes
2
4
MUM5:
transposition
MUM3:
Random match
Inexact repeat
Leads to Global MUM-alignment
Results: Alignment of M. tuberculosis strains
CDC1551 (Top) & H37Rv (bottom)
Single green lines
indicate SNPs
Blue lines
indicate insertions
Comparison of 2 Mycoplasma genomes
cousins that are distantly related
M. genitalium: 580 074 nt
M. pneumoniae: 816 394 (+226 000)
Analysis of proteins tell us that all M.g. proteins are
present in P.m.
Alignment was carried using
FASTA (dividing each genome into 1000 bp)
All-against-all searches
Fixed length of pattern (25)
Using MUMmer (length = 25)
Comparison of 2 Mycoplasma genomes
Using FASTA
Fixed length
patterns: 25mers
MUMmer