FASTA

6,246 views 14 slides Oct 04, 2020
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede...


Slide Content

FASTA
Amandeep Singh
Assistant Professor
Department of Biotechnology
GSSDGS KhalsaCollege Patiala

Introduction
FASTA uses an algorithm for similarity search for nucleotide or protein
sequence from a biological database.
Nucleotide Sequence (Query)
Protein Sequence (Query)
Nucleotide Sequence (Database)
Protein Sequence (Database)

FASTA Algorithm
It start from a Dot-plot or Dot-matrix.
A B C D E F
A
B
M
D
L
F
Second Sequence (Database)
First Sequence
(Query)
Shows regions of similarity
between 2 Sequences
represented as diagonals.

FASTA Algorithm
•FASTA goes a step forward from dot-plot
•It calculates the sum of dots along each diagonal.
•It is a “word” based method.
•It looks for matching “word” or the sequence of patterns called “k-tuple”
Tuple: Finite ordered list of elements
Sequence patterns: 1 or 2 amino acids, or 5 or 6 nucleotides
•Build local alignment using this “word” or “k-tuple”.
•Match identical “word”
•Create diagonals by joining adjacent matches.
•Rescore the highest scoring system using PAM or BLOSUM matrix.
•Best of these scores is called init1.
•Join segments using gaps, the best score from this is called initn.
•Use Dynamic programing (Smith-Waterman algorithm) to create the optimal alignment.

FASTA Algorithm

FASTA Implementation
FASTA3(https://www.ebi.ac.uk/Tools/sss/fasta/)attheEBIisoneof
themostpopularFASTAimplementations.

FASTA Output
•The Histogram
•The Sequence listing
•The Local alignments

FASTA Output
The Histogram
•First part of FASTA output is Histogram.
•Predicted extreme value is represented by asterisk * symbol
•Actual numbers obtained is represented by equal = sign
•First column: z-opt score
•Second column: number of sequences with these z-opt scores
•Third column: Expected number of alignments
Histogram used to determine, whether statistical theory is valid or not.
•If equal sign follow predicted value Valid
•If equal sign do not follow predicted value Invalid

FASTA Output: The Histogram

FASTA Output: The Sequence listing
•Listing of the best scoring sequences in the database.
•Best sequence: reported first
•Worst sequence: reported last
First Column Second
Column
Opt
column
Last
Column
DatabaseDatabase
accession
number
Database
identifier
Total length
of database
sequence
Final scoreE-Value

FASTA Output: The Sequence listing

FASTA Output: The Local alignments
Display:
The local alignment
Init1 & Initnscores
E-value
Opt-score
Z-score
Percent identity

Significance of E-Value
•E-ValueorExpectedvalueisaboutnumberof
alignmentshitbychance.
•SmallertheE-value:Lesslikelyagivenalignment
occurredbychance.

Variants of FASTA
•FastA-ComparesaDNAquerysequencetoaDNAdatabase,ora
proteinquerytoaproteindatabase,detectingthesequencetype
automatically.
•FASTX-ComparesaDNAquerytoaproteindatabase.Itmay
introducegapsonlybetweencodons.
•FASTY-ComparesaDNAquerytoaproteindatabase,optimizing
gaplocation,evenwithincodons.
•TFASTA-ComparesaproteinquerytoaDNAdatabase.