Genomics and proteomics Sequence Alignment ppt 26 4 24.pptx
lokiashok007
1 views
23 slides
Oct 07, 2025
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
Ppt done and presented by Professor Madanapriya ma'am
Size: 1.63 MB
Language: en
Added: Oct 07, 2025
Slides: 23 pages
Slide Content
GENOMICS AND PROTEOMICS UNIT 1 III.B.Sc. Biotechnology Batch: 2022 - 2025 Regulation: 2020 Date: 26.04.2024
Sequence Analysis Subjecting DNA, RNA or peptide sequence to analytical methods To understand its features, function, structure, or evolution. Performed on the entire genome, transcriptome or proteome of an organism, Involve only selected segments or regions, like tandem repeats and transposable elements. Comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. To assign function to coding and non-coding regions in a biological sequence usually by comparing sequences and studying similarities and differences .
Sequence analysis in molecular biology includes a very wide range of processes: Tools and T echniques to understand its biology The comparison of sequences to find similarity, often to infer if they are related (homologous) Identification of intrinsic features of the sequence such as Active sites post translational modification sites gene-structures Reading frames Distributions of introns and exons and regulatory elements Identification of sequence differences and variations such as point mutations and single nucleotide polymorphism (SNP) in order to get the genetic marker. Revealing the evolution and genetic diversity of sequences and organisms Identification of molecular structure from sequence alone.
HISTORY Insulin protein were characterized by Fred Sanger in 1951, Knowledge to understand the function of molecules. Discoveries contributed to the successful sequencing of the first DNA-based genome. “Sanger method” or Sanger sequencing Milestone in sequencing long strand molecules such as DNA
Nucleotide sequence analysis (DNA & RNA) Identify functional elements like protein binding sites, uncover genetic variations like SNPs, study gene expression patterns,understand the genetic basis of traits understand mechanisms that contribute to processes like replication and transcription To limit wrong conclusions due to poor quality data. The tools used at this stage depend on the sequencing platform. FastQC checks the quality of short reads (including RNA sequences) Nanoplot or PycoQC are used for long read sequences (e.g. Nanopore sequence reads) MultiQC aggregates the result of FastQC in a webpage format Quality control provides information such as read lengths, GC content, presence of adapter sequences (for short reads), and a quality score
Analyses steps are peculiar to DNA sequences Variant calling Variant annotation Visualization and interpretation Gene expression analysis identify differentially expressed genes (DEGs) between experimental conditions using statistical methods like DESeq2 Functional enrichment analysis (Functional enrichment analysis identifies biological processes, pathways, and functional impacts associated with differentially expressed genes obtained Analyses of protein sequences Proteome sequence analysis studies the complete set of proteins expressed by an organism or a cell under specific conditions. It describes protein structure, function, post-translational modifications, and interactions within biological systems It often starts with raw mass spectrometry (MS) data from proteomics experiments, typically in mzML, mzXML, or RAW file formats
Genome browsers in sequence analysis Offer a non-code, user-friendly interface to visualize genomes and genomic segments identify genomic features, and analyze the relationship between numerous genomic elements The three primary genome browsers: Ensembl genome browser, UCSC genome browser, National Centre for Biotechnology Information (NCBI) support different sequence analysis procedures including genome assembly, genome annotation, comparative genomics like exploring differential expression patterns identifying conserved regions All browsers support multiple data formats for upload and download and provide links to external tools and resources for sequence analyses, which contributes to their versatility
Sequence alignment There are millions of protein and nucleotide sequences known. These sequences fall into many groups of related sequences known as protein families or gene families. Relationships between these sequences are usually discovered by aligning them together and assigning this alignment a score. Types of sequence alignment Pair-wise sequence alignment only compares two sequences at a time M ultiple sequence alignment compares many sequences Important Algorithms for aligning pairs of sequences Needlema-Wunsch algorithm Smith-Waterman algorithm
Tools for sequence alignment (common) BLAST DOT plots ClustalW PROBCONS MUSCLE MAFFT T-Coffee A common use for pairwise sequence alignment is to take a sequence of interest and compare it to all known sequences in a database to identify HOMOLOGOUS SEQUENCES T he matches in the database are ordered to show the most closely related sequences first , followed by sequences with diminishing similarity In general, t. These matches are usually reported with a measure of statistical significance such as an EXPECTATION VALUE
Sequence alignment is considered the most essential step in comparing biological sequences. Sequence alignment arranges two or more nucleotide or amino acid sequences to identify regions of similarity between the sequences. These regions of similarity are helpful in understanding the functional, structural, and evolutionary relationships between the sequences. Two commonly used sequence alignment algorithms are G lobal alignment and Local alignment .
Global alignment Local alignment A method of comparing two sequences, which aligns the entire length of the sequences by maximizing the overall similarity. This method is used when comparing sequences that are of the same length . In local alignment, instead of attempting to align the entire length of the sequences, only the regions with the highest density of matches are aligned. This is useful for identifying short conserved regions in protein or nucleotide sequences.
Global Sequence Alignment Local Sequence Alignment In global alignment, an attempt is made to align the entire sequence (end to end alignment) Finds local regions with the highest level of similarity between the two sequences A global alignment contains all letters from both the query and target sequences A local alignment aligns a substring of the query sequence to a substring of the target sequence If two sequences have approximately the same length and are quite similar, they are suitable for global alignment Any two sequences can be locally aligned as local alignment finds stretchess of sequences with high level of matches without consedring the alignment of rest of the sequences Suitable for aligning two closely related sequences Suitable for aligning more divergent sequences or distantly related sequences Global alignments are usually done for comparing homologous genes like comparing two genes with same function (in human vs, mouse) or comparing two proteins with similar function Used for finding out conserved patterns in DNA sequences or conserved domains of motifs in two proteins A generalglobal alignment technique is the Needleman- Wunsch algorithm A general local alignment method is Smith-Waterman algorithm Examples - global alignment tools EMBOSS Needle Needleman-Wunsch Global Align Nucleotide Sequences (Specialized BLAST) Examples for Local alignment tools BLAST EMBOOS Water LALIGN
PAIRWISE ALIGNMENT Pairwise sequence alignment is the type of sequence alignment that involves aligning two sequences to identify the optimal pairing of the sequences. It is based on a scoring system that assigns positive scores to matching characters and negative scores to mismatching characters or gaps. The main objective of pairwise sequence alignment is to obtain the highest possible score, which indicates the degree of similarity between the two sequences.
Multiple Sequence Alignment Multiple Sequence Alignment involves aligning multiple ( three or more ) biological sequences to achieve optimal sequence matching. Multiple sequence alignments are used to identify conserved sequence regions and to construct phylogenetic trees Help us understand the functional and evolutionary relationships between different species or groups of organisms
Methods of pairwise sequence alignment DOT Matrix Dynamic programming Word or k-tuple method Dot matrix method, also known as the dot plot method, is a graphical method of sequence alignment that involves comparing two sequences by plotting them in a two-dimensional matrix. Dynamic programming is used to find the optimal alignment between two proteins or nucleic acid sequences by comparing all possible pairs of characters in the sequences. Word or k-tuple methods are heuristic methods best known for their use in the database search tools FASTA and BLAST.
DOT Matrix In a dot matrix, two sequences that must be compared are plotted along a matrix’s horizontal and vertical axes. The method then scans each residue of one sequence to identify similarities with all residues in the other sequence. If a residue in one sequence matches a residue in the other sequence, a dot is placed in the corresponding position in the matrix. Otherwise, the matrix position is left blank. If the two sequences being compared are highly similar, the dot plot will display as a single line along the matrix’s main diagonal. However, when the sequences are less similar, the dot plot will show more scattered dots with fewer diagonal lines, indicating that the sequences share less similarity. Dot plots can also find repeat elements in a single sequence. Short parallel lines above and below the main diagonal indicate the presence of repeats.
Figure: Example of comparing two sequences using dot plots. (Xiong, J., 2006). DOT Matrix plot
Multiple sequence Alignment Multiple sequence alignment can be performed using either exhaustive or heuristic approaches. Exhaustive algorithms Heuristic algorithm Exhaustive algorithms Exhaustive alignment involves examining all possible alignments at once. A multidimensional search matrix is required to perform multiple sequence alignment using the exhaustive algorithm, similar to the two-dimensional matrix used in dynamic programming for pairwise alignment. This means that to align N sequences, an N-dimensional matrix is required. Dynamic programming is a powerful method for aligning sequences, but as the number of sequences to be aligned increases, the amount of computational time and memory space also increases. This means that the method becomes computationally impractical for large data sets. As a result, dynamic programming is typically only used for small data sets with fewer than ten short sequences. Heuristic approaches are typically used for larger data sets to achieve a more efficient alignment.
Heuristic algorithm Progressive method The progressive method, also known as the tree-based algorithm, is a step-wise assembly of multiple alignments based on pairwise similarity. This method is called progressive because it aligns sequences in a step-wise manner. Progressive method The progressive method, also known as the tree-based algorithm, is a step-wise assembly of multiple alignments based on pairwise similarity. This method is called progressive because it aligns sequences in a step-wise manner.
Iterative method Iterative Method The iterative method involves improving an initial suboptimal solution by repeatedly modifying it until an optimal solution is reached. PRRN is a web-based program that uses the iterative method of alignment. An initial pairwise alignment is conducted to create a tree that provides weights for creating alignments. Aligned regions with gaps are identified and iteratively adjusted to enhance the alignment score. The highest-scoring alignment is used in a new set of calculations to predict a new tree, new weights, and new alignments. The procedure is repeated until there is no more improvement in the alignment score.
Block Based Method The progressive and iterative alignment methods are based on global alignment and may not be effective in identifying conserved domains and motifs in highly divergent sequences of different lengths. To align such divergent sequences, a local alignment-based approach is needed. The block-based method is one such method that identifies a block of ungapped alignment that is shared by all sequences.
Applications of Sequence Alignment Sequence alignment can identify unknown sequences by comparing them with already known sequences in databases. Sequence alignment is also used to identify conserved sequence patterns and motifs, which helps to characterize the functions of the sequences. Sequence alignment can also produce phylogenetic trees and obtain information about the evolutionary relationship between the sequences aligned. Sequence alignment can also predict proteins’ secondary and tertiary structures. It can also predict gene locations and new members of gene families. Sequence alignment can also be used to develop degenerate PCR primers by analyzing multiple related sequences.