BLAST AND FASTA.pptx12345789999987544321234

635 views 20 slides Feb 28, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

234e57890-09875432


Slide Content

BLAST & FASTA ALI ZAIN LECTURER DEPARTMENT OF BIOTECHNOLOGY

BLAST With the increase in DNA and protein sequence databases, there is a growing need for faster and efficient methods to analyze this large amount of data. One of the most commonly used bioinformatics tools today to study DNA and protein sequences is called BLAST. BLAST stands for Basic Local Alignment Search Tool . It is a widely used bioinformatics program that was first introduced by Stephen Altschul et al. in 1990 and has since become one of the most popular tools for sequence similarity search . BLAST is a powerful tool for analyzing biological sequence data . Since the initial release of BLAST in 1990, it has undergone continuous updates to improve its speed and accuracy. BLAST is now considered a crucial and widely used tool in the field of bioinformatics. It has played a vital role in numerous research studies and has paved the way for the development of other sequence comparison tools.

5 Types of BLAST There are five types (variants) of BLAST that are differentiated based on the type of sequence (DNA or protein) of the query and database sequences. BLASTN compares a nucleotide query sequence to a nucleotide sequence database. BLASTP compares a protein query sequence to a protein sequence database. BLASTX compares a nucleotide query sequence to a protein sequence database by translating the query sequence into its six possible reading frames and aligning them with the protein sequences. TBLASTN compares a protein query sequence to a nucleotide sequence database by translating the nucleotide sequences in all six reading frames and aligning them with the protein sequence. TBLASTX compares a nucleotide query sequence to a nucleotide sequence database by translating the query sequence in all six reading frames and aligning them with the nucleotide sequences.

How BLAST Works BLAST works by comparing a query sequence to a database of sequences to find regions of similarity. It uses a heuristic approach to search for similarities in the database, making it faster and more efficient. BLAST performs sequence alignment through the following steps. Step 1: The first step is to create a lookup table or list of words from the query sequence. This step is also called seeding. First, BLAST takes the query sequence and breaks it into short segments called words. For protein sequences, each word is usually three amino acids long, and for DNA sequences, each word is usually eleven nucleotides long.

Step 2: The second step is to search a database of known sequences to find any sequences that contain the same words as the query sequence. This is done to identify database sequences containing the matching words.

Step 3: BLAST then scores the similarity of the matching words. The matching of the words is scored by a given substitution matrix. If a word is above a certain threshold, it is considered a match. Two commonly used substitution matrices for protein sequences are PAM (Percent Accepted Mutations) and BLOSUM (Blocks Substitution Matrix). For nucleotide sequences, the scoring matrix is based on match-mismatch scoring. Step 4: The fourth step involves pairwise alignment by extending the words in both directions while counting the alignment score using the same substitution matrix. If the score drops below a certain threshold due to differences in the sequences or mismatches, the alignment stops. The resulting aligned segment pair without gaps is called the high-scoring segment pair (HSP). BLAST also calculates a statistical significance value for each alignment. It is called E-value or Expect value. The E-value represents the probability of obtaining a sequence match by random chance. A lower E-value indicates that the sequence match is less likely to be a result of random occurrence. Hence, the lower the E-value, the higher the level of significance.

Characteristics of BLAST Several key features of BLAST make it a widely used tool in bioinformatics. Some of these are: BLAST is fast and efficient, making it possible to handle large databases of sequences. It is a flexible and versatile tool as it can be used to search for similarities in both nucleotide and protein sequences. It is highly sensitive which allows the identification of even small similarities between sequences. It aims to identify regions of local similarity between the query sequence and the database sequence, rather than attempting to align the entire sequences. It has a user-friendly interface that makes it easy to input query sequences and interpret the results.

Applications of BLAST BLAST has a wide range of applications. Some of the most common applications are: BLAST can be used to identify unknown sequences by comparing them with known sequences in a database which helps in predicting the functions of proteins or genes. BLAST can also be used in phylogenetic analysis which is important for understanding the evolutionary relationships between different species. BLAST can also be used to identify functionally conserved domains within proteins which is important for predicting the functions of proteins .

FASTA Database similarity searching is an essential technique in bioinformatics as it allows us to characterize newly determined sequences by comparing them to existing databases. FASTA is one of the first widely-used database similarity search tools. FASTA (or FastA ), an abbreviation for ‘Fast-All’, is a sequence alignment tool that takes nucleotide or protein sequences as input and compares it with existing databases. It was first developed by David J. Lipman and William R. Pearson in 1985 and has since been refined and adapted for various applications.   The text-based file format for representing nucleotide or protein sequences, which originates from the FASTA program, has now become a standard in bioinformatics. Many other sequence database search tools also use the FASTA file format.

FASTA Programs FASTA was originally developed for comparing protein sequences. The original program was referred to as FASTP. It quickly became a popular tool for sequence alignment and database searching. The program has been continually updated and improved. There are now different FASTA programs available, each used for different types of sequence searches: FASTA compares a DNA query sequence against a database of DNA sequences or a protein query sequence against a database of protein sequences using the FASTA algorithm. SSEARCH performs protein-protein or DNA-DNA comparisons using the Smith-Waterman algorithm. GGSEARCH/ GLSEARCH works using a global alignment algorithm (GGSEARCH) or a combination of global and local alignment algorithms (GLSEARCH) to compare protein and nucleotide sequences. FASTX/ FASTY compares a DNA sequence and a database of protein sequences by translating the DNA sequence into three frames and allowing gaps and frameshifts. TFASTX/ TFASTY compares a protein sequence and a database of DNA sequences. The DNA sequence is translated in six frames – three in the forward direction and three in the reverse direction. FASTF/ TFASTF compares mixed peptide sequences against a protein (FASTF) or translated DNA (TFASTF) databases. FASTS/ TFASTS compares a set of short peptide fragments against the protein (FASTS) or translated DNA (TFASTS) databases .

How FASTA Works FASTA works by comparing a query sequence to a database of sequences to identify similar matches. The program uses a heuristic algorithm to quickly search the database and identify the most significant matches. The working mechanism of FASTA is described in the following steps: Step 1: Identifying Regions The first step is identifying regions with high similarity by creating a lookup table for the query sequence. This step is also called hashing step. To create the lookup table, the query sequence is first broken down into smaller words known as k-tuples ( ktup ). When the ktup value is increased, the number of background word hits is reduced. By reducing the number of these background word hits, the algorithm can focus on the more relevant hits, enhancing the overall search speed. k-tuple is usually 2 for proteins and 6 for nucleotide sequences. Once the lookup table is created, it is used to identify matches between the k-tuples in the query sequence and the database sequences. Similar regions are represented as diagonals in a two-dimensional matrix. The ten regions with the highest density of word matches are the high-similarity regions, and these best ten diagonals are saved.

Step 2: Re-Scoring In the second step, the ten best diagonals are rescored using suitable scoring matrices. For protein, BLOSUM50 or PAM matrix is used; for DNA sequences, the identity matrix is used. A subregion with the highest score is identified for each of the rescanned diagonal regions. These high-scoring subregions within the diagonals are called initial regions. Step 3: Joining Threshold Next, a score cutoff or the joining threshold is applied that excludes segments unlikely to be part of the final alignment. The library sequences are ranked based on their initial scores. The regions with initial scores above the pre-set threshold are selected and checked to see if they can be joined together. This step introduces gaps between the diagonals while applying gap penalties. The score of the gapped alignment is calculated by subtracting a penalty for each gap, which is used to rank the database sequences by similarity. Step 4: Final Alignment Finally, the gapped alignment is refined to produce the final alignment. This is done by using the banded Smith-Waterman algorithm, which is a dynamic programming algorithm that calculates the optimal score (opt) for alignment. This score is used for statistical calculations.

Statistical Significance and FASTA FASTA also provides an estimate of the statistical significance of each alignment found. It is evaluated using the E-value, which measures the likelihood of obtaining a sequence alignment score by chance. The smaller the E-value, the more significant the alignment. E-value is not the only statistical parameter. FASTA also uses other statistical measures, such as the bit score and the similarity score based on the scoring matrix and gap penalties, to evaluate the significance of sequence alignments. The FASTA output also includes an additional statistical parameter, the Z-score, which represents the number of standard deviations from the mean score of the database search. A higher Z-score value indicates a more significant match .

Applications of FASTA FASTA has a wide range of applications. Some are: FASTA can be used in the sequence alignment to identify regions of similarity. This is useful for identifying conserved regions in DNA or protein sequences, which can help to identify functional domains or motifs. Identifying these functional domains or motifs can provide insights into the biological function of the sequence. FASTA can be used to search large databases of sequences to find matches to a given query sequence. This helps to identify homologous sequences, which can help to predict the function of a newly identified sequence. FASTA can construct phylogenetic trees by aligning sequences from different species and identifying evolutionary relationships between them .
Tags