TABLE OF
CONTENTS
01
Introduction
(Sequence
Alignment & Their
Types)
BLOSUM
(Algorithm,
BLOSUM Score &
BLOSUM62)
PAM, (Sources Of
Error, PAM250 &
Comparison)
02 03
INTRODUCTION
SEQUENCE
ALIGNMENT
Sequence alignment is a fundamental
bioinformatics technique used to compare
and identify similarities and differences
between two or more biological
sequences, such as DNA, RNA, or protein
sequences. The primary goal of sequence
alignment is to find regions of similarity or
homology between sequences, which can
provide insights into their evolutionary
relationships, functional similarities, or
structural features.
TYPES OF
SEQUENCE ALIGNMENT
Aligns two sequences
over their entire length.
Needleman-Wunsch
algorithm is a classic
example.
Useful for comparing
entire genes or
genomes.
Local
Aligns sequences globally
but allows gaps at the
ends.
Modified from local
alignment algorithms like
Smith-Waterman.
Useful for aligning partial
sequences with varying
lengths.
Global Semi-Global
Finds the best alignment
between subsequences
within sequences.
Smith-Waterman algorithm
is a well-known example.
Useful for identifying
specific conserved regions
or motifs within
sequences.
Standard scoring matrices are used in
bioinformatics and computational biology to
assess the similarity or dissimilarity between
sequences, particularly in the context of
sequence alignment (e.g., protein or DNA
sequence alignment). Different scoring
matrices have been developed to account
for the variability in sequence evolution
rates and mutation probabilities.
DIFFERENT STANDARD
SCORING MATRICES
Some of the most commonly
used standard scoring
matrices:
BLOSUM (BLOcks
SUbstitution Matrix)
PAM (Percent
Accepted Mutation)
GONNET
JTT (Jones-Taylor-
Thornton)
NUC (Nucleotide)
Matrices
BLOSUM
I.BLOSUM matrices are widely used for protein sequence alignments.
II.The BLOSUM matrices were developed by Dr. Steven Henikoff and Dr.
Jorja Henikoff.
III.The matrices are called BLOSUM matrices, with an index denoting the level
of clustering: for example, BLOSUM62 is derived from sequence blocks
clustered at the 62% identity level.
IV.BLOSUM30 to BLOSUM90: These matrices are designed for sequences
with high similarity (few substitutions) and are often used for local
alignments.
V.BLOSUM45 to BLOSUM80: Intermediate matrices suitable for moderate
similarity.
VI.BLOSUM62: The most widely used matrix for general protein sequence
alignments.
VII.BLOSUM100 and BLOSUM120: Designed for sequences with very low
similarity.
1. BLOSUM (BLOcks SUbstitution Matrix)
The BLOSUM62 Matrix
1. Data Collection:
Collect a dataset of related protein sequences.
2. Create Sequence Blocks:
Divide sequences into conserved blocks.
3. Calculate Frequencies:
Calculate amino acid frequencies in blocks.
4. Pairwise Comparison:
Identify amino acid substitutions in sequence pairs.
5. Calculate Substitution Probabilities:
Calculate substitution probabilities.
6. Log-Odds Ratios
Compute log-odds ratios for substitutions.
7. Normalize :
Normalize log-odds scores.
8. Matrix Generation:
Generate the BLOSUM matrix.
9. Matrix Parameters:
Parameterize the matrix based on similarity threshold.
ALGORITHM
(Steps Of Building
BLOSUM Matrix)
THE BLOSUM SCORE
Understanding BLOSUM Scores in Sequence Alignments:
Numerical value indicating amino acid similarity in alignments.
Derived from BLOSUM matrices based on real substitution data.
Positive: Substitution favoured; Negative: Substitution avoided.
Magnitude shows strength of preference.
Used in sequence alignment algorithms (e.g., BLAST).
Matrices for different similarity levels (e.g., BLOSUM45, BLOSUM62).
Guides matches, mismatches, and gap penalties.
Reflects biological data, aids in evolutionary analysis.
PAM
Definition: PAM stands for "Percent Accepted Mutation," a model and set
of substitution matrices used in bioinformatics.
Introduction: Developed by Margaret Dayhoff and colleagues in the 1970s,
PAM matrices quantify evolutionary distances between amino acid
sequences.
Matrix Set: PAM matrices are grouped from PAM1 to PAM250, each
representing a specific level of sequence divergence.
Method: Created by analyzing real sequence data, calculating probabilities
of amino acid substitutions over evolutionary time.
2. PAM (Percent Accepted Mutation):
Usage: PAM matrices are employed to score protein sequence
alignments, with higher scores indicating greater sequence
similarity.
Precision: Best suited for closely related sequences where a
constant substitution rate is assumed.
Limitations: Less accurate for highly divergent sequences, as the
constant rate assumption may not hold.
Historical Significance: PAM was among the earliest models for
quantifying protein sequence evolution, contributing significantly to
bioinformatics.
Specific amino acid substitution matrix in the PAM model.
Represents an evolutionary distance of approximately 250
PAM units.
Developed by Margaret Dayhoff in the 1970s.
Used for scoring protein sequence alignments.
Higher scores indicate greater sequence similarity.
Widely used in bioinformatics tools like BLAST.
Suited for moderately related protein sequences.
Values based on observed substitution data.
Can be customized for specific analyses.
Assumes a constant substitution rate, may be less accurate
for highly divergent sequences.
PAM250
The PAM250 Matrix
1. Constant Rate Assumption: Assumes a constant rate of
amino acid substitution, which may not hold in all cases.
2. Simplified Model: Simplifies complex evolutionary
processes, potentially missing important nuances.
3. Data Dependence: Accuracy relies on the quality and
representativeness of input data.
4. Limited Applicability: Most accurate for closely
related sequences, less so for highly divergent ones.
SOURCES OF ERROR IN PAM
MODEL:
COMPARISON
Origin:
PAM(Percent Accepted Mutation): Based on an evolutionary model.
BLOSUM (BLOcks SUbstitution Matrix): Derived from actual
sequence alignments.
Nature:
PAM: Provides matrices with specific evolutionary distances (e.g.,
PAM250).
BLOSUM: Offers matrices optimized for varying sequence
similarity (e.g., BLOSUM45, BLOSUM62).
Usage:
PAM: Suitable for comparing sequences at specific evolutionary
depths.
BLOSUM: Adaptable for a wide range of sequence similarities.
Comparison Between PAM and BLOSUM :
Interpretation:
PAM: Reflects theoretical evolutionary relationships.
BLOSUM: Grounded in empirical substitution patterns.
Matrix Variation:
PAM: Matrices differ by evolutionary distance (e.g., PAM1, PAM250).
BLOSUM: Matrices differ by sequence similarity (e.g., BLOSUM45,
BLOSUM80).
Advantages:
PAM: Suitable for deep evolutionary comparisons.
BLOSUM: Versatile for diverse sequence alignments.
Choice:
PAM: Chosen based on evolutionary analysis needs.
BLOSUM: Chosen based on sequence similarity characteristics.
Practical Use:
PAM: Common in research requiring precise evolutionary information.
BLOSUM: Common in sequence alignment tools like BLAST.
COMPARISON
1. Alignment Basis: Scoring matrices underpin sequence alignment algorithms, aiding
in the comparison of biological sequences.
2. Biological Validity: Derived from real biological data, they reflect substitution
patterns during evolution.
3. Homology Identification: Crucial for detecting homologous sequences, which share
common ancestry.
4. Phylogenetics: Used in constructing evolutionary trees and understanding species
relationships.
5. Database Searches: Enables efficient searching of biological databases for similar
sequences.
6. Research Applications: Widely used in protein structure prediction.
7. Sequence Annotation: Identifies conserved regions, domains, and post-translational
modification sites.
11. Quality Control: Assists in assessing alignment quality and ensuring research
reliability.
Significance Of Scoring Matrices