SCORING MATRIX Scoring system is a set of values for qualifying the set of one residue being substituted by another in an alignment. It is also known as substitution matrix. Scoring matrix of nucleotide is relatively simple. A positive value or a high score is given for a match & negative value or a low score is given for a mismatch. Scoring matrices for amino acids are more complicated because scoring has to reflect the physicochemical properties of amino acid residues.
Scoring Matrices for Aligning DNA Sequences Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimidine (C/T) is replaced by another pyrimidine (C/T). Tansversions --- (A/G) (C/T) 1 G 1 C 1 T 1 A G C T A Identity matrix 1 -5 -5 -1 G -5 1 -1 -5 C -5 -1 1 -5 T -1 -5 -5 1 A G C T A Transition- Transversion matrix
Scoring a sequence alignment Match score: +1 Mismatch score: +0 Gap penalty: –1 ACGTCTGAT A CGCCGTAT A GTCTATCT ||||| ||| || |||||||| ----CTGAT T CGC---AT C GTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Score = + 11 Gaps: 7 × ( – 1)
Amino Acid Substitution Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - Block substitutions based on local alignments [similarity among conserved sequences]
PAM Matrix First given by Dayhoff who compiled alignment of 71 groups of very closely related protein sequences. PAM- Point Accepted Mutation. PAM matrix were derived based on evolutionary divergence between sequences of protein structure. Construction of PAM1 matrix involves alignment of full length sequence & subsequent construction of phylogenic trees using parsimony principle.
Ancestral sequence information is used to count the number of substitution along each branch of tree. Positive scores in the matrix denotes substitutions occurring more frequently than expected among evolutionary conserved replacements. Negative score corresponds to substution which occurs less frequently. A PAM is defined as 1% amino acid change or one mutation per 100 residues. The increasing PAM numbers correlate with increasing PAM units & thus evolutionary distances of protein sequences.
Limitations of PAM Matrices Constructed based on the phylogenetic relationships prior to scoring mutations ; Difficulty of determining ancestral relationships among sequences ; Based on a small set of closely related proteins;
BLOSUM Matrices It is a series of block amino acid substitution matrix. Derived on the basis of direct observation for e very possible amino acid substitution in multiple sequence alignment. Sequence pattern is also called as block . Ungapped alignments are less than 60 amino acid in length. BLOSUM matrix are actual % values of sequence selected for construction of matrix.
BLOSUM 62 indicates that sequence selected for constructing the matrix is an average share of 62%. BLOSUM share for a particular residue pair is derived from the log ratio of observed residue substitution versus the expected probability of particular residue. Lower the number of BLOSUM more divergent species are present.
Part of BLOSUM 62 Matrix C S T P A G C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 1 -1 4 G -3 -2 -2 6 BLOSUM62 was measured on pairs of sequences with an average of 62 % identical amino acids. Log-odds = log ( ) chance to see the pair in homologous proteins chance to see the pair in unrelated proteins by chance
PAM vs. BLOSUM PAM Based on mutational model of evolution ( Markov process ) PAM1 is based on sequences of 85% similarity Designed to track the evolutionary origins BLOSUM Based on the multiple alignment of blocks Good to be used to compare distant sequences Designed to find proteins’ conserved domains
BIBLOGRAPHY ESSENTIAL BIOINFORMATICS by Xiong NCBI Handbook www.google.com