In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede...
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
Size: 1.04 MB
Language: en
Added: Oct 04, 2020
Slides: 14 pages
Slide Content
FASTA
Amandeep Singh
Assistant Professor
Department of Biotechnology
GSSDGS KhalsaCollege Patiala
Introduction
FASTA uses an algorithm for similarity search for nucleotide or protein
sequence from a biological database.
Nucleotide Sequence (Query)
Protein Sequence (Query)
Nucleotide Sequence (Database)
Protein Sequence (Database)
FASTA Algorithm
It start from a Dot-plot or Dot-matrix.
A B C D E F
A
B
M
D
L
F
Second Sequence (Database)
First Sequence
(Query)
Shows regions of similarity
between 2 Sequences
represented as diagonals.
FASTA Algorithm
•FASTA goes a step forward from dot-plot
•It calculates the sum of dots along each diagonal.
•It is a “word” based method.
•It looks for matching “word” or the sequence of patterns called “k-tuple”
Tuple: Finite ordered list of elements
Sequence patterns: 1 or 2 amino acids, or 5 or 6 nucleotides
•Build local alignment using this “word” or “k-tuple”.
•Match identical “word”
•Create diagonals by joining adjacent matches.
•Rescore the highest scoring system using PAM or BLOSUM matrix.
•Best of these scores is called init1.
•Join segments using gaps, the best score from this is called initn.
•Use Dynamic programing (Smith-Waterman algorithm) to create the optimal alignment.
FASTA Algorithm
FASTA Implementation
FASTA3(https://www.ebi.ac.uk/Tools/sss/fasta/)attheEBIisoneof
themostpopularFASTAimplementations.
FASTA Output
•The Histogram
•The Sequence listing
•The Local alignments
FASTA Output
The Histogram
•First part of FASTA output is Histogram.
•Predicted extreme value is represented by asterisk * symbol
•Actual numbers obtained is represented by equal = sign
•First column: z-opt score
•Second column: number of sequences with these z-opt scores
•Third column: Expected number of alignments
Histogram used to determine, whether statistical theory is valid or not.
•If equal sign follow predicted value Valid
•If equal sign do not follow predicted value Invalid
FASTA Output: The Histogram
FASTA Output: The Sequence listing
•Listing of the best scoring sequences in the database.
•Best sequence: reported first
•Worst sequence: reported last
First Column Second
Column
Opt
column
Last
Column
DatabaseDatabase
accession
number
Database
identifier
Total length
of database
sequence
Final scoreE-Value
FASTA Output: The Sequence listing
FASTA Output: The Local alignments
Display:
The local alignment
Init1 & Initnscores
E-value
Opt-score
Z-score
Percent identity
Significance of E-Value
•E-ValueorExpectedvalueisaboutnumberof
alignmentshitbychance.
•SmallertheE-value:Lesslikelyagivenalignment
occurredbychance.
Variants of FASTA
•FastA-ComparesaDNAquerysequencetoaDNAdatabase,ora
proteinquerytoaproteindatabase,detectingthesequencetype
automatically.
•FASTX-ComparesaDNAquerytoaproteindatabase.Itmay
introducegapsonlybetweencodons.
•FASTY-ComparesaDNAquerytoaproteindatabase,optimizing
gaplocation,evenwithincodons.
•TFASTA-ComparesaproteinquerytoaDNAdatabase.