Gaps in an Alignment Gap opening penalty Gap extension penalty
Scoring Matrices are used to assign a score to each comparison of a pair of characters. The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar pairs. The matrices were constructed by analysing known families of proteins. Scoring Matrices
BLOSUM versus PAM The PAM family – PAM matrices are based on global alignments of closely related proteins. – The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence; Other PAM matrices are extrapolated from PAM1. Developed by Margaret Dayhoff and co-workers. The BLOSUM family – BLOSUM matrices are based on local alignments (blocks) – All BLOSUM matrices are based on observed alignments (BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% similarity) Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance; BLOSUM is the opposite. – For alignment of distant proteins, you use PAM150 instead of PAM100, or BLOSUM50 instead of BLOSUM62. Scoring Matrices
For global alignments use PAM matrices Lower PAM matrices---find short alignments of highly similar regions Higher PAM matrices find weaker long alignments For local alignments use BLOSUM matrices BLOSUM matrices with high numbers---better for similar sequences BLOSUM matrices with low number—are better for distant sequences Scoring Matrices
Assignment 2: Introduction to BLAST B asic L ocal A lignment S earch T ool
BLAST Results Max score: The score of the highest scoring HSP from that database sequence Total score: The total score of all HSP's from that database sequence. Query Coverage: It is the percent of length of the query covered. Max Identity: It is the maximal percent identity of the HSP HSP=High-scoring Segment Pair : It is a local alignment that achieves one of the highest alignment scores in a given search.
Query - sequence used for the search Subject - sequence that was found to match the similarity criteria
Steps for searching a protein sequence database by a query protein sequence include the following : Eg : Searching with the word : PQG The likelihood of a match to itself is found in the BLOSUM62 matrix as the log odds score of a P-P match + a Q-Q match + G-G match =7+5+6 =18 Similarly matches of PQG to PEG would score 15 PRG 14 PSG 13 and PQA 12 If the cutoff score T is 13 possible matches to PQG would include PEG(15) but not PQA(12) The above procedure is repeated for each three-letter word in the query sequence.
Is the similarity significant or could it have arisen by chance? If the score of the alignment observed is no better than might be expected from a random permutation of the sequence, then it is likely to have arisen by chance. The alignment is unlikely to be significant, if the randomized sequences score as well as the original one.
Significance of BLAST results- Z score and p-value Z-score =0 => observed similarity is no better than the average of random permutations of sequence, and might well have arisen by chance The Z-score reflects the extent to which the original result is an outlier from the randomized sequence P-Value: P is another measure of significance. It is the probability that the observed match could have happened by chance. P<=10-100 :exact match P in range 10-100 to 10-50 :sequences very nearly identical P in range 10-50 to 10-10 :closely related sequences, homology certain P in range 10-5 to 10-1 :usually distant relatives P > 10-1 :insignificant match probably
Significance of BLAST results- E- value The E-value of an alignment is the expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. E is found by multiplying the value of P by the size of the database probed. E-values range between 0 and the number of sequences in the database searched. E<=0.02 :sequences probably homologous E between 0.02 and 1 :homology cannot be ruled out E >1 :expect this as good a match by chance
PSI-BLAST PHI-BLAST Algorithms may also differ: Sequences types used in BLAST may differ: