How to Apply Bioinformatics In Proteomics Seyed mohammad motevalli December 2013
outline Introduction to bioinformatics Biological databases Sequence alignment and their algorithms Structural prediction Web-based tools Stand-alone software
Introduction to bioinformatics What is the bioinformatics? Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science .
Introduction to bioinformatics What are differences between bioinformatics and informatics? What are differences between bioinformatics and computational biology? What is the algorithm?
What is the proteomics!?
Biological databases Database A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria Entry Each record should contain a number of fields that hold the actual data items Value a particular piece of information Making a query To retrieve a particular record from the database , a user can specify a value to be found in a particular field and expect the computer to retrieve the whole data record
Biological databases Primary databases Gen bank (NCBI) www.ncbi.nlm.nih.gov EMBL www.ebi.ac.uk/embl/index.html DDBJ www.ddbj.nig.ac.jp Secondary databases ExPASY http ://web.expasy.org PIR http://pir.georgetown.edu/pirwww/pirhome3.shtml SWISS- Prot www.ebi.ac.uk/swissprot/access.html
Biological databases Interconnection between Biological Databases
Biological databases Pitfalls of biological databases The causes of redundancy include: repeated submission of identical or overlapping sequences by the same or different authors, revision of annotations, dumping of expressed sequence tags (EST) data Redundant sequences Non-redundant sequences (Ref Seq )
Sequence alignment and their algorithms Pairwise sequence alignment Pairwise sequence alignment is the process of aligning two sequences and is the basis of database similarity searching and multiple sequence alignment Sequence similarity versus sequence homology When two sequences are descended from a common evolutionary origin, they are said to have a homologous relationship or share homology. A related but different term is sequence similarity , which is the percentage of aligned residues that are similar in physiochemical properties such as size, charge, and hydrophobicity Sequence similarity versus sequence identity In a protein sequence alignment, sequence identity refers to the percentage of matches of the same amino acid residues between two aligned sequences. Similarity refers to the percentage of aligned residues that have similar physicochemical characteristics and can be more readily substituted for each other
Sequence alignment and their algorithms Sequence alignment strategies Global a lignment In global alignment, two sequences to be aligned are assumed to be generally similar over their entire length. Alignment is carried out from beginning to end of both sequences to find the best possible alignment across the entire length between the two sequences Local alignment In local alignment does not assume that the two sequences in question have similarity over the entire length. It only finds local regions with the highest level of similarity between the two sequences and aligns these regions without regard for the alignment of the rest of the sequence regions
Sequence alignment and their algorithms
Sequence alignment and their algorithms Linear gap penalty: The cost for creation and extension of gaps are the same W(I)= gI , g is the cost for each gap and I is the length Affine gap penalty: different cost for creation and extension W(I)=g open + g ext (I-1) and g open < G ext
Sequence alignment and their algorithms Alignment Algorithms And Methodes T he dot matrix method T he word method The dynamic programming method
Alignment Algorithms The dot matrix method The most basic sequence alignment method is the dot matrix method, also known as the dot plot method Sequence alignment and their algorithms
Sequence alignment and their algorithms Alignment Algorithms The word method It works by finding short stretches of identical or nearly identical letters in two sequences. These short strings of characters are called words, which are similar to the windows used in the dot matrix method
Sequence alignment and their algorithms Alignment Algorithms The word method
Alignment Algorithms The dynamic programming method Dynamic programming is a method that determines optimal alignment by matching two sequences for all possible pairs of characters between the two sequences Sequence alignment and their algorithms
Sequence alignment and their algorithms Alignment Algorithms The dynamic programming method Global alignment The classical global pairwise alignment algorithm using dynamic programming is the Needleman– Wunsch algorithm. In this algorithm, an optimal alignment is obtained over the entire lengths of the two sequences Local alignment The first application of dynamic programming in local alignment is the Smith–Waterman algorithm. In this algorithm, positive scores are assigned for matching residues and zeros for mismatches. No negative scores are used
Sequence alignment and their algorithms substitution matrix PAM matrices ( point accepted mutation) The PAM matrices were subsequently derived based on the evolutionary divergence between sequences of the same cluster. One PAM unit is defined as 1% of the amino acid positions that have been changed. Because of the use of very closely related homologs, the observed mutations were not expected to significantly change the common function of the proteins
Sequence alignment and their algorithms substitution matrix PAM matrices (point accepted mutation)
Sequence alignment and their algorithms substitution matrix BLOSUM matrices This is the series of blocks amino acid substitution matrices (BLOSUM), all of which are derived based on direct observation for every possible amino acid substitution in multiple sequence alignments
Sequence alignment and their algorithms substitution matrix BLOSUM matrices
Sequence alignment and their algorithms What Matrices should be used and when?
Comparison PAM is based on an evolutionary model using phylogenetic trees BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins
Sequence alignment and their algorithms Heuristic database searching The heuristic algorithms perform faster searches because they examine only a fraction of the possible alignments examined in regular dynamic programming BLAST (basic local alignment search tool) BLAST uses heuristics to align a query sequence with all sequences in a database
Sequence alignment and their algorithms BLAST (basic local alignment search tool)
Sequence alignment and their algorithms Minimum Score (S) Neighborhood Score Threshold (T) Threshold for stopping extension Negative scores from scoring matrix If the extension stopped after crossing the X, the alignment is called High-scoring segment pair (HSP) 6- finishing
Sequence alignment and their algorithms Suggested BLAST Cutoffs For nucleotide-based searches: hits with E values of 10 -6 or less and seq identity 70% or more For protein-based searches: hits with E values of 10 -3 or less and seq. identity of 25% or more. Finding by chance in nucleotide database is more than proteins Identity in proteins is more informative than in the nucleic acids
Sequence alignment and their algorithms BLAST (basic local alignment search tool) BLASTN queries nucleotide sequences with a nucleotide sequence database BLASTP uses protein sequences as queries to search against a protein sequence database BLASTX uses nucleotide sequences as queries and translates them in all six reading frames to produce translated protein sequences, which are used to query a protein sequence database TBLASTN queries protein sequences to a nucleotide sequence database with the sequences translated in all six reading frames TBLASTX uses nucleotide sequences, which are translated in all six frames, to search against a nucleotide sequence database that has all the sequences translated in six frames
Sequence alignment and their algorithms PSI-BLAST Position-specific iterated BLAST (PSI-BLAST) builds profiles and performs database searches in an iterative fashion. The main feature of PSI-BLAST is that profiles are constructed automatically and arefine-tunedin each successive cycle
Sequence alignment and their algorithms PSI-BLAST
Sequence alignment and their algorithms Multiple sequence alignment
Sequence alignment and their algorithms Multiple sequence alignment Exhaustive algorithms The exhaustive alignment method involves examining all possible aligned positions simultaneously Heuristic algorithms Because the use of dynamic programming is not feasible for routine multiple sequence alignment , faster and heuristic algorithms have been developed. computational strategy to find a near-optimal solution by using rules of thumb . Essentially, this strategy takes shortcuts by reducing the search space according to certain criteria
Sequence alignment and their algorithms Multiple sequence alignment Heuristic algorithms Progressive alignment Progressive alignment depends on the stepwise assembly of multiple alignment and is heuristic in nature Clustal It is a progressive multiple alignment program available either as a stand-alone or on-line program T-coffee T-coffee performs progressive sequence alignments as in Clustal . The main difference is that, in processing a query, T-Coffee performs both global and local pairwise alignment for all possible pairs involved. The global pairwise alignment is performed using the Clustal program
Sequence alignment and their algorithms Multiple sequence alignment Heuristic algorithms Iterative alignment The iterative approach is based on the idea that an optimal solution can be found by repeatedly modifying existing suboptimal solutions
Sequence alignment and their algorithms Multiple sequence alignment Heuristic algorithms Block-Based Alignment The strategy identifies a block of ungapped alignment shared by all the sequences, hence , the block-based local alignment strategy
Structural prediction Structural prediction methods Ab -initio prediction Computational prediction based on first principles or using the most elementary information Threading Method of predicting the most likely protein structural fold based on secondary structure similarity with database structures and assessment of energies of the potential fold. The term has been used interchangeably with fold recognition Homology-based modeling Method for predicting the three-dimensional structure of a protein based on homology by assigning the structure of an unknown protein using an existing homologous protein structure as a template
Hidden Markova algorithm Statistical model composed of a number of interconnected. Markov chains with the capability to generate the probability value of an event by taking into account the influence from hidden variables . Mathematically , it calculates probability values of connected states among the Markov chains to find an optimal path within the network of states. It requires training to obtain the probability values of state transitions. When using a hidden Markov model to represent a multiple sequence alignment, a sequence can be generated through the model by incorporating probability values of match, insertion, and deletion states
Hidden Markova algorithm
Neural network algorithm Machine-learning algorithm for pattern recognition. It is composed of input, hidden, and output layers. Units of information in each layer are called nodes. The nodes of different layers are interconnected to form a network analogous to a biological nervous system. Between the nodes are mathematical weight parameters that can be trained with known patterns so they can be used for later predictions. After training, the network is able to recognize correlation between an input and output
Neural network algorithm
Web-based tools Alignment tools Sequence-based methods T-coffee http :// tcoffee.crg.cat/apps/tcoffee/do:regular NCBI http:// blast.ncbi.nlm.nih.gov/Blast.cgi Uniprot http :// www.uniprot.org EMBL http:// coot.embl.de/Alignment Structural-based methods Dali server http://ekhidna.biocenter.helsinki.fi/dali_server FSSP http:// protein.hbu.cn/fssp Signal peptide resource http://proline.bic.nus.edu.sg/spdb/searchn.html Active site prediction http ://www.scfbio-iitd.res.in/dock/ActiveSite.jsp