blast and fasta are two softwares in bioinformatics, blast usually used for similarity checking
Size: 556.96 KB
Language: en
Added: Oct 06, 2018
Slides: 27 pages
Slide Content
BLAST & FASTA By, Allie N U, MSc biotechnology.
Introduction Used to find the local similarity or alignment shared by two sequences. Method to find the similarity is called the alignment. It can be of two types, Global alignment – align the entire sequence using as many characters as possible. Local alignment – focuses on region of similarity in parts of the sequence only
Alignment of two sequences is performed by following methods: Dot matrix analysis Dynamic programming Word or k- tuple method (FASTA & BLAST programs)
Word or k - tuple Align two sequences very quickly, first by searching for identical short stretches of sequences called word or k – tuple . Then by joining these words into an alignment by dynamic programming method. BLAST and FASTA methods are heuristic.
BLAST - introduction Basic local alignment search tool (BLAST) is a popular user friendly tool for searching all the major sequence databases. It is used to find sequence homolog to predict the identity, function, 3D structure of the query sequence. It shows better results for protein sequences than nucleotide sequences.
Salient features Local alignment: BLAST tries to find patches of regional similarity, rather than trying for global fit between the query and the database sequence. BLAST works under the assumption that high-scoring alignments are likely to contain short stretches of identical or near identical letters, called words .
Overview BLAST is extremely fast, the program can be run locally or queries can be e-mailed to NCBI server. It does not guarantee to find the best alignment between query and database, it may miss matches. Its because its strategy is expected to find most matches, & this way it sacrifices complete sensitivity thus to gain speed .
Working (brief) BLAST searches in two phases. First, it looks for short subsequences that are likely to have significant matches. Then it tries to extend these matched regions on both sides in order to obtain maximum sequence similarity.
General working of BLAST
Substitution matrix It is a scoring method used in alignment of one residue against other. Margaret dayhoff and her co-workers developed the first substitution matrix used in comparison of protein sequences for evolutionary terms. These matrices are commonly called as PAM matrices. In contrast to PAM, Steve Henikoff and his coworkers developed BLOSUM matrices.
Substitution matrix Percent accepted mutation matrix( PAM) BLOSUM PAM matrices are based on global alignment of closely related proteins. Number accompanying PAM refers to evolutionary distanced. Larger number represent greater evolutionary distance. PAM 250 is widely used. BLOSUM matrices are based on local alignments. Smaller number corresponds to greater evolutionary distant sequences. BLOSUM 62 is widely used
Steps involved Pre processing of the query:- Quickly locate ungapped similarity between query sequence and sequence from database. All words of length ‘W’, of the query are compared with database sequences. Generation of hits:- Hit is made with one or several successive pairs of similar words, and characterised by its positon in each of two sequences. All the possible hits between query and database are calculated
Extension of the hits:- every hit is now extended, without gaps, inorder to determine whether this hits may be part of a larger segment of similarity. every extended segment pair that scores the same or better than S (set as parameter of program) is kept and called as HSP( high scoring segment pair).
Types of BLAST Standard BLAST are of five types: BLASTp BLASTn BLASTx tBLASTn tBLASTx Other class include: MegaBLAST PSI BLAST PHI BLAST
BLASTp – this program compares an amino acid query sequence against a protein sequence database. BLASTn – it compares a nucleotide query sequence against a nucleotide sequence database. BLASTx – it searches the six frame translation products of a nucleotide sequence against a protein database. tBLASTn – it searches a protein sequence against translated nucleotide sequence in the database. tBLASTx – it compares the six frame translations of a nucleotide query sequence against six frame translations of database.
Mega BLAST – it is a program optimized for aligning long sequences. It can only work with DNA sequences. PSI BLAST – it stands for position specific iterated BLAST. It is useful for protein similarity search. PHI BLAST – pattern hit initiated BLAST, it can be used to search for a specific pattern or motif
FASTA It’s a sequence analysis tool, similar to BLAST. It was developed by W.R. Pearson and Lipman and this algorithm can be accessed from EBI site. Fast A gives better results for nucleotide sequences than protein. FastP is for protein sequences.
Working (brief) finds regions of similarity by first breaking the sequence into short subsequences, then searching for diagonals with highest density of words that match. The alignment in diagonals is then refined. Its fast but is not guaranteed to find the best alignment, it may miss matches.
Steps involved First FASTA prepares a list of words from the pair of sequences to be matched. Words can be 3-6 nucleotides or 1or 2 amino acids. It uses non overlapping words, it matches the words and makes a count of it. It creates the word diagonal and finds a high scoring match. The output is labeled as unit1 Only if score is sizable it proceeds to the second level. In the second level, for every best hit of words, it looks for neighboring approximate hits If the score value is good, and prepares a larger dot matrix diagonal.
The best score from this second level scoring is called initin , The initin scores are saved for each comparison of a query sequence with database sequence.
Types Different programs in FASTA include FASTP (protein sequence). TFASTA (compares a query protein sequence to a DNA sequence database). FASTF( compares a set of ordered peptide fragments obtained from analysis of protein by cleavage and sequencing of protein bands resolved by electrophoresis against a protein database). TFASTF( compares a set of ordered peptide fragments against a DNA database).