DNA is less conserved than protein sequences ( codon degeneracy; synonymous mutations) – Less effective to compare coding regions at nucleotide level SEQUENCE COMPARISON Motifs/Domains-similarity over small stretches Sequence families—similarity over longer sequences Comparison can help us with Structure, function and evolution
Alignment of E.coli promoter sequences An alignment is an arrangement of two sequences opposite one another It shows where they are different and where they are similar. The alignment of a pair of nucleic acid or protein sequences can also reveal whether or not there is an evolutionary relationship between the sequences.
Global & Local Alignment Local Alignment: finds best matched subsequence Global Alignment: finds the overall match or similarity between two sequences
Local alignments: why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: – Homeobox genes have a short region called the homeodomain that is highly conserved between species. – A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence
How is an Alignment Done? When we compare sequences we take two strings of letters ( nucleotides or amino acids) and align them. Where the characters are identical, we give them a positive score and where they differ, a negative value. We count the identical and non-identical characters and then calculate the total score of the alignments . While aligning sequences, we want to find the optimal alignment-with most similarities and least differences.
What is the logical basis for similarity/ dissimilarity? Evolutionary considerations are very important.
Scoring Matrices
Gaps in an Alignment Gap opening penalty Gap extension penalty
Scoring Matrices are used to assign a score to each comparison of a pair of characters. The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar pairs. The matrices were constructed by analysing known families of proteins. Scoring Matrices
BLOSUM versus PAM The PAM family – PAM matrices are based on global alignments of closely related proteins. – The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence; Other PAM matrices are extrapolated from PAM1. Developed by Margaret Dayhoff and co-workers. The BLOSUM family – BLOSUM matrices are based on local alignments (blocks) – All BLOSUM matrices are based on observed alignments (BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% similarity) Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance; BLOSUM is the opposite. – For alignment of distant proteins, you use PAM150 instead of PAM100, or BLOSUM50 instead of BLOSUM62. Scoring Matrices
For global alignments use PAM matrices Lower PAM matrices---find short alignments of highly similar regions Higher PAM matrices find weaker long alignments For local alignments use BLOSUM matrices BLOSUM matrices with high numbers---better for similar sequences BLOSUM matrices with low number—are better for distant sequences Scoring Matrices
Assignment 2: Introduction to BLAST B asic L ocal A lignment S earch T ool