unit3.pptxbsbsbshdhshsjshsbsnsnjsjdjdjdjdjd

KishoreSubramaniyan 2 views 33 slides Oct 30, 2025
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Bioinformatics


Slide Content

Unit -3

Multiple sequence analysis Multiple Sequence Alignment (MSA) is a fundamental technique in bioinformatics used to align three or more biological sequences—typically DNA, RNA, or protein sequences to identify similarities, differences, and conserved regions. MSA provides valuable insights into evolutionary relationships, functional annotations, and structural characteristics of sequences. Progressive alignment Iterative alignment

The primary goals of MSA include: Identification of homologous sequences: MSA helps identify sequences that share a common ancestor, indicating evolutionary relationships. Detection of conserved regions: By aligning sequences, MSA reveals regions that are conserved across species or within a protein family, highlighting functional importance. Structural prediction: MSA can aid in predicting the three-dimensional structure of proteins, as conserved regions often correspond to structural motifs. Functional inference: MSA can help predict the function of uncharacterized sequences based on the conservation of specific residues or motifs.

Progressive Alignment Progressive alignment is a widely used method in bioinformatics for aligning multiple sequences. It is based on the principle of aligning sequences pairwise in a hierarchical or “progressive” manner, starting with the most similar sequences and gradually incorporating more divergent sequences into the alignment. Here’s an introduction to the progressive alignment process: Pairwise Alignment: The progressive alignment process begins with pairwise alignments of all sequences. Each pair of sequences is aligned to identify regions of similarity and dissimilarity. Similarity Score: A similarity score is assigned to each pair of sequences based on the quality of their alignment. This score reflects the degree of similarity between the sequences. Guide Tree Construction: The similarity scores are used to construct a guide tree, also known as a dendrogram or phylogenetic tree. The guide tree represents the evolutionary relationships between the sequences, with closely related sequences clustered together.

MUSCLE (Multiple Sequence Comparison by Log-Expectation):  MUSCLE is a progressive alignment algorithm known for its speed and accuracy. It uses a progressive alignment approach along with a refinement stage to improve alignment quality. T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation):  T-Coffee is a progressive alignment algorithm that uses a consistency-based approach to align sequences. It aligns sequences in a pairwise manner and then uses a guide tree to construct the final multiple sequence alignment.

Iterative Method: Iterative methods are a class of algorithms used in multiple sequence alignment (MSA) to improve the accuracy of alignments by iteratively refining the initial alignment. These methods are particularly useful for aligning distantly related sequences where traditional progressive alignment methods may not be accurate. Here’s an overview of iterative methods in MSA: Initialization: The iterative process begins with an initial alignment, which can be generated using a progressive alignment algorithm or any other alignment method. Profile Construction: A profile is constructed from the initial alignment, representing the frequencies of each residue at each position in the alignment. Profiles capture the conservation patterns in the alignment and are used to guide the alignment process. Sequence Realignment: Sequences are aligned to the current alignment using the profile constructed in the previous step. This step may involve aligning sequences that were not included in the initial alignment or realigning sequences that were poorly aligned in the initial alignment.

5. Scoring and Evaluation: After realignment, the quality of the alignment is evaluated using a scoring function. Common scoring functions include sum-of-pairs scores, which measure the number of correctly aligned residue pairs, and column scores, which measure the conservation of columns in the alignment. 6. Iterative Refinement: The realignment and evaluation steps are repeated iteratively until a stopping criterion is met. This criterion may be a maximum number of iterations, a convergence threshold for alignment scores, or a maximum change in the alignment from one iteration to the next. 7. Consensus Alignment: Finally, a consensus alignment is generated from the iterations, typically by taking the most common residue at each position in the alignment or by using a probabilistic model to combine the alignments.

MAFFT (Multiple Alignment using Fast Fourier Transform):  MAFFT is a popular algorithm for MSA that uses an iterative refinement approach. It is known for its speed and accuracy, especially for aligning large datasets. Clustal Omega:  Clustal Omega is another widely used MSA algorithm that offers improved speed and scalability compared to earlier versions like ClustalW. It uses a progressive alignment approach with heuristics to improve alignment quality.

Flexible sequence similarity searching with the FASTA3 Program Package The FASTA program package has evolved significantly since its introduction 10 years ago The original package offered four programs: fast a, tfasta , lfasta , and rdf ( rdfwas introduced with the first fastp program in 1985 Today, programs are available for rigorous Smith-Waterman searches(ssearch3) for searches with mixed peptide sequences (jastj3 and tfastf3); the programs for translated DNA: protein sequence comparison have been improved substantially with the introduction of fastx3, fasty3, tfastx3, and tfasty3, and the program for estimating statistical significance from shuffled sequence similarity scores (prss3) produces accurate statistical estimates.

programs in the FASTAJ package are preferred over the older FASTA2 programs if FASTA3 has the function you need Programs in the FASTAJ package have more robust statistical estimates and error handling, a larger variety of scoring matrices (FASTA3 has MDMIO, MDM20, PAM120,and BLOSUM80 in addition to PAM250 BLOSUM50. and BLOSUM62 in FASTA2, and a broader array of comparison functions (jasty3, fastf3, tfasty3, and tfastf3)

Key Features of FASTA3 Flexible scoring: Supports different substitution matrices (PAM, BLOSUM, nucleotide scores) Heuristic + refinement: Detect short exact matches (k-tuples).Extend and join matches. Refine using dynamic programming (local alignment).Statistical estimates: Uses E-values and Z-scores to judge biological significance. Handles frameshifts: FASTX/FASTY → useful for ESTs, pseudogenes. Multiple programs in package: FASTA: protein/protein, DNA/DNA TFASTA: DNA (translated) vs. protein FASTX/FASTY: DNA vs protein with frameshift SSEARCH: rigorous Smith–Waterman LFASTA: multiple queries.

FASTA3 are a protein sequence to a protein sequence database or a DNA sequence to a DNA sequence database using the FASTA algorithm Search speed and selectivity are controlled with the ktup (word size) parameter, For protein comparisons, ktup = 2 by default; ktup = 1 is more sensitive but slower For DNA comparisons, ktup = 6 by default; ktup = 3or ktup = 4 provides higher sensitivity; k1up = 1 should be used for oligonuc1eotides (DNA query lengths <20).

How FASTA3 Works (Conceptual Flow) Step 1: Identify short k-tuple matches (e.g., k=2 for proteins, k=6 for DNA). Step 2: Score diagonals with high-density matches. Step 3: Extend promising alignments with local dynamic programming. Step 4: Select best scoring regions. Step 5: Apply rigorous Smith–Waterman refinement (optional). Step 6: Calculate E-value and Z-score .

FASTA3 vs BLAST Feature FASTA3 BLAST Speed Slower Faster Sensitivity Higher Lower Statistics E-value + Z-score E-value only Frameshift search Yes (FASTX/FASTY) No

How to Run FASTA3 Programs 1. Prepare Your Input Sequences must be in FASTA format 2. Running the sequence similarity search with ssearch36 3. Parameters used: Matrix BLOSUM 62,gap opening and gap extension 4. Interpreting the output Alignment, score and Gap information 5. Adjusting the parameters Substitution matrix BLOSUM62, PAM250,NUC 4.4 for DNA 6. Visualize and analyze : jalview , Aliview -Identify conserved regions MEGA, Phyml - Phylogenetic tree

Use of CLUSTALW for multiple sequence alignment Input File Requirements All sequences must be in a single file . Supported formats: EMBL/ SwissProt , NBRF/PIR, FASTA, GCG/MSF, ODE, GCG/RSF, CLUSTAL .

CLUSTALX CLUSTAL X has all the features of CLUSTAL W with a graphical user interface (GUI) Additional features include easier interaction and profile alignment capabilities UNIX/Linux versions require an X-Terminal; Windows and MacOS do not Mode of alignment: Multiple alignment mode: default mode for aligning multiple sequences Profile alignment mode: allows alignment of new sequences to a previous alignment or aligning two sets of aligned sequences.

Loading Sequences:Sequences are loaded via the Load sequences dialog (File menu).Input file format is the same as in CLUSTAL W.The main display includes:Sequence order (editable with cut/paste).Residue position ruler.Alignment quality graph. Alignment Parameters :Multiple alignment can be performed using default parameters or customized via interactive dialogs.Parameters include:Gap penalties (for gap opening and extension).Minimal distance between gaps.Treatment of end gaps.Similar menus exist in CLUSTAL W. Visualization & Coloring: After alignment, conserved and aligned residues are color- coded.Two coloring schemes:Residue-specific color (independent of alignment position).Consensus-based color (depends on alignment conservation).Users can use default or custom color parameter files for highlighting conserved positions.

Realignment of Divergent Regions CLUSTAL X allows refinement of misaligned regions in highly divergent sequences.Two realignment options:Sequence-specific realignmentSelect specific sequences by clicking their names.Selected sequences are removed from the alignment and realigned to the remaining sequences.Residue -range realignmentSelect a specific residue range (highlighted).The selected portion is removed, realigned using progressive alignment, and then inserted back into the full alignment. Enhanced Output FeaturesColor PostScript output:Allows publication-quality or presentation-ready figures.Options include paper size, orientation, residue colors , and layout.All other output formats available in CLUSTAL W (FASTA, Phylip , etc.) are also supported.

Submitting DNA protein sequence to databases Submitting to a public biological database If you want your sequence to be publicly available (like in GenBank / ENA / DDBJ) These are the three main sequence databases (part of the INSDC consortium) GenBank (NCBI, USA) → BankIt submission tool ENA (EMBL-EBI, Europe) → ENA submission portal DDBJ (Japan) → DDBJ submission system Steps:Create an account (NCBI/ENA/DDBJ). Prepare your DNA sequence in FASTA format. Provide annotation details (organism, source, gene name, coding region, etc.). Upload via their web submission portal. After review, your sequence gets an accession number (like OR123456).

Submitting aligned sets of sequences, Updating the submitted sequences 1. Submitting to Public Databases (NCBI/ENA/DDBJ) If you already have a multiple sequence alignment (MSA) and want to submit: GenBank (NCBI) → accepts aligned sets of sequences (for example, population studies, haplotypes, phylogenetic projects). The tool is called BankIt ( WebSub ) or Sequin .

Steps for NCBI GenBank: Align your sequences (e.g., Clustal Omega, MUSCLE, MAFFT). Save in FASTA + alignment format (sometimes NEXUS or PHYLIP for tree submissions). Prepare annotation (organism name, gene region, features). Use NCBI WebSub / BankIt → select Aligned Set Submission . There’s a specific option for “ PopSet ” (population sequence set submissions). After review, you’ll get an accession series (like OR123456-OR123499). Example: If you aligned 20 COI gene sequences from different species → you can submit them as a PopSet in NCBI.

Updating in Public Databases (GenBank, ENA, DDBJ) If you already submitted sequences and later find corrections (typos, new annotation, better assembly): You cannot directly edit a GenBank entry yourself. Instead: Email the database curators (NCBI/ENA/DDBJ) with the accession number of your sequence. Provide the corrected sequence or metadata . They will update the entry while keeping the original accession number (but the version number increases).

Methods of Phylogenetic Analysis A phylogenetic tree (evolutionary tree) is the graphical representation of the evolutionary history of biological sequences and allows us to visualize the evolutionary relationships between them Rooted trees are trees that have a specified root node, which represents the common ancestor of all the organisms in the tree. Unrooted trees do not have a specified root node and show only the branching pattern of the evolutionary relationships among taxa or OTUs, without any information about their common ancestor

Cladogram is a type of phylogenetic tree that displays only the branching pattern of evolutionary relationships among organisms Cladograms are unscaled, which means that the branch lengths do not reflect the amount of evolutionary divergence between taxa or operational taxonomic units (OTUs) Phylogram is a type of phylogenetic tree that represents the evolutionary relationships among organisms by showing both the branching pattern and the amount of evolutionary divergence. Phylograms are scaled, which means that the branch lengths are proportional to the amount of evolutionary divergence.

Phylogenetic Tree Construction Steps 1. Selection of molecular marker 2. Multiple sequence alignment 3. Selection of a model of evolution 4. Construction of the phylogenetic tree 5. Assessment of the reliability of the tree

Phylogenetic Tree Construction Methods 1. Distance-based methods Distance-based tree construction methods involve calculating evolutionary distances between sequences by using substitution models, which are then used to construct a distance matrix. Using the distance matrix, a phylogenetic tree is constructed. The two popular distance-based methods are UPGMA and NJ. 2. Character-Based Methods Character-based methods involve analyzing sequence data by directly examining the sequence characters, rather than relying on pairwise distance comparisons. These methods evaluate all sequences at once by analyzing one character or site at a time. The maximum parsimony (MP) and maximum likelihood (ML) methods are the two most commonly used character-based tree construction methods.

Unweighted Pair Group Method with Arithmetic Mean (UPGMA) UPGMA is the simplest distance-based method that constructs a rooted phylogenetic tree using sequential clustering. First, all sequences are compared using pairwise alignment to calculate the distance matrix. Using this matrix, the two sequences with the smallest pairwise distance are clustered as a single pair. A node is placed at the midpoint between them. Next, the distance between this pair and all other sequences is recalculated to form a new matrix. This new matrix is used to identify and cluster the sequence that is closest to the first pair. This process is repeated until all sequences have been placed on the tree. UPGMA method assumes that the evolutionary rate of all taxa is constant, and they are equidistant from the root, indicating the presence of a molecular clock mechanism.

Neighbor-Joining (NJ) The neighbor-joining method is the most widely used distance-based method. It is similar to the UPGMA method in terms of building the tree using a distance matrix however, it does not assume the molecular clock and produces an unrooted tree. The neighbor-joining algorithm starts with a completely unresolved star tree, where all sequences are connected to a single node. It then iteratively adds branches between the two closest neighbors and the remaining sequences in the tree. The algorithm calculates the pairwise distances between all sequences and uses these distances to determine the closest neighbors. Once the closest neighbors are identified, the algorithm consolidates them into a new node, effectively reforming the star tree. This process is repeated until all sequences are connected in a fully resolved tree.

Maximum parsimony (MP) Maximum parsimony method is a character-based method that selects the tree with the least number of evolutionary changes or the shortest total branch length. Initially, multiple sequence alignment is performed to identify potential positions in the sequences that correspond to each other. Each aligned position is analyzed to identify the trees that require the smallest number of evolutionary changes to produce the observed sequence changes. This process is repeated for all positions in the sequence alignment, and the trees that produce the lowest overall number of changes for all positions are selected. This method works best for relatively similar sequences and for small numbers of sequences.

Maximum likelihood (ML) Maximum likelihood is a statistical method that uses probabilistic models to identify the most appropriate tree that has the maximum probability of generating the observed data. Similar to the maximum parsimony method, this approach evaluates each column of a multiple sequence alignment during the analysis. However, unlike maximum parsimony, ML considers all possible trees that could explain the observed data. The likelihood of each possible tree is calculated, and the tree with the highest probability is selected as the most likely evolutionary history of the sequences.

Applications of the phylogenetic tree Phylogenetic trees have various practical applications, including: Phylogenetic trees can be used to study the evolutionary relationships between different species and to understand the evolutionary processes over time. Phylogenetic trees can be used to study the diversity and distribution of species and to develop conservation strategies to protect endangered species and ecosystems. Phylogenetic trees can be used to identify the origins of pathogens and to track the spread of diseases. Phylogenetic trees can also be used in forensics to identify the origins of biological samples found at crime scenes and to link suspects to crimes. Phylogenetic trees are useful for organizing and classifying organisms and species according to their DNA sequences and morphological similarities and differences.
Tags