What is Sequence Alignment? In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural , or evolutionary relationships between the sequences . Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.
A sequence alignment, produced by ClustalO , of mammalian histone proteins. Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).
Importance of sequence alignment Sequence determines structure and structure determines function. By studying sequence similarities, we can find the correlation between sequences, structures, functions and evolutionary linkages. To know the strategies of aligning whether nucleotide or protein sequence is fundamental area f bioinformatics. It is important for newly determined sequence to compare it with other sequences that already exist in the databases, to determine the structure, functions and evolutionary linkages of newly determined sequence with existing sequences in the databases. Thus the process of comparison is sequence alignment . The process in which sequences are compared by searching for common character patterns and establishing residue to residue correspondence among related sequences is called sequence alignment.
Interpretation If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance.
What is INDEL? Indel is a molecular biology term for the insertion or the deletion of bases in the DNA of an organism. It has slightly different definitions between its use in evolutionary studies and its use in germ-line and somatic mutation studies. In evolutionary studies, indel is used to mean an insertion or a deletion and indels simply refers to the mutation class that includes both insertions, deletions, and the combination thereof, including insertion and deletion events that may be separated by many years, and may not be related to each other in any way. In germline and somatic mutation studies, indel describes a special mutation class, defined as a mutation resulting in both an insertion of nucleotides and a deletion of nucleotides which results in a net change in the total number of nucleotides, where both changes are nearby on the DNA. A microindel is defined as an indel that results in a net change of 1 to 50 nucleotides.
What are point mutations? A point mutation , or single base modification , is a type of mutation that causes a single nucleotide base substitution, insertion, or deletion of the genetic material, DNA or RNA. The term frameshift mutation indicates the addition or deletion of a base pair.
Nonsense mutations Code for a stop, which can truncate the protein. A nonsense mutation converts an amino acid codon into a termination codon . This causes the protein to be shortened because of the stop codon interrupting its normal code. How much of the protein is lost determines whether or not the protein is still functional. Missense mutations Code for a different amino acid. A missense mutation changes a codon so that a different protein is created, a non-synonymous change. Conservative mutations Result in an amino acid change. However, the properties of the amino acid remain the same (e.g., hydrophobic, hydrophilic, etc.). At times, a change to one amino acid in the protein is not detrimental to the organism as a whole. Most proteins can withstand one or two point mutations before their function changes. Non-conservative mutations Result in an amino acid change that has different properties than the wild type. The protein may lose its function, which can result in a disease in the organism. For example, sickle-cell disease is caused by a single point mutation (a missense mutation) in the beta-hemoglobin gene that converts a GAG codon into GUG, which encodes the amino acid valine rather than glutamic acid. The protein may also exhibit a "gain of function" or become activated, such is the case with the mutation changing a valine to glutamic acid in the braf gene; this leads to an activation of the RAF protein which causes unlimited proliferative signalling in cancer cells. These are both examples of a non-conservative ( missense ) mutation.
Silent mutations Code for the same amino acid. A silent mutation has no effect on the functioning of the protein. A single nucleotide can change, but the new codon specifies the same amino acid, resulting in an unmutated protein. This type of change is called synonymous change, since the old and new codon code for the same amino acid. This is possible because 64 codons specify only 20 amino acids. Different codons can lead to differential protein expression levels, however.
DNA and protein are the products of evolution. DNA and protein are biological macromolecules composed of nucleotides and amino acid to form linear sequences and these sequences determine the primary structure of the molecules. These molecules also store the history of evolution. The presence of evolutionary traces in the sequences because some of the residues that perform key functional and structural roles tend to be preserved by natural selection. While other residues that are less common tend to mutate frequently. For example active site residues of an enzyme family tend to be conserved because they are responsible for catalytic functions. Hence, by alignment, we can identify the conserved and varied region (patterns of conservation and variations).
The degree of sequence conservation in the alignment demonstrate the evolutionary relatedness among different specie while variations between sequences demonstrate the changes that have occurred during evolution in the form of substitution, insertion and deletion. We can demonstrate the function of unknown sequences by identifying the evolutionary relationship between sequences. If we find a “significant similarity” among sequences, so we can say that they belong to same family (protein family). By using the information (structure and function) of known protein sequence, we can predict the structure and function of uncharacterized sequences. If two sequences have significant similarity, so we can say that they have been from common ancestor.
Significance of sequence alignment It is helpful in the determination of, Function Structure Evolutionary relationship
Methods of Sequence Alignment Two methods Pair wise sequence alignment Multiple sequence alignment
Bioinformatics tools for sequence alignmen t Tools for pair wise sequence alignment are, Needle (EMBOSS) Stretcher (EMBOSS) Water (EMBOSS) Matcher (EMBOSS) LALIGN Wise2DBA GeneWise PromoterWise Tools for MSA are, Clustal Omega Kalign MAFFT MUSCLE Mview T-Coffee WebPRANK
Pair wise Sequence Alignment It is the process of comparison of two sequences. Alignment has three aspects, Quantity To what degree sequences are similar (%) Quality Regions of similarity in a given sequence. Optimal alignment The maximum similarity and the least differences.
We can compare diseased genome/proteome to healthy one. We also compare unknown gene with known gene. We can compare sequence of one organism with other organisms that how closely they are related. We can also compare two proteins of same family. In pairwise sequence alignment, homology, similarity and identity are important.
Homology vs. Similarity vs. Identity Sequence Homology When two sequences are emerged from a common evolutionary origin (same biological ancestor), then it is said to be homologous relationship. Sequence that shares common evolutionary origin (does the same function) are homologous It is qualitative term.
Sequence Similarity It is the percentage of align residues that are similar in physiochemical properties like size, charge, hydrophobicity . It is quantitative term For example, subtilisin and chymotrypsin are homologous but their structure and size are different. ATCGGC and ATCGCG are similar Similarity is based on physiochemical and biochemical properties of amino acid that how close they are? Similarity is actually the same nature of two amino acids Similarity refers to the percentage of aligned residues that have similar physiochemical properties.
Sequence Identity It refers to the percentage of matches of same amino acids residues between two aligned sequences. ATCGGC and ATCGGC are Identical Safe zone: If two protein sequences are 30% or more than 30% identical, so they are in safe zone means they are homologous and closely related to same ancestors, such zone is safe zone.
Twilight zone: When two sequences have less than 25% identity. Identity falls between 20% is basically in twilight zone. So they are whether homologous or non homologous. This range of identity is called twilight zone. Homologous ----- similarity from same ancestor Non homologous---- similarity from non similar ancestor. Midnight zone: If identity of two sequences is less than 20% so unrelated sequences are present, they did not connect to same ancestor, such zone is called mid night zone.
The three zones of protein alignments. Two protein sequences can be regarded homologous if the percentage sequence identity falls in the safe zone. Sequence identity values below the zone boundary, but above 20%, are considered to be in twilight zone., where homologous relationships are less certain. The region below 20% is the midnight zone, where homologous relationships can not be reliably determined.
Types of pair wise sequence alignment Two types of pair wise sequence alignment, Local alignment Global alignment
Local alignment Local alignment only aligns the most similar regions between sequences. Local alignments are more suitable for aligning sequences. It determines the local regions of highest level of similarity/identity between two sequences. In local alignment, the alignment stops at the end of regions of identity. Dashes in alignments indicate that these sequences are not included. Local alignment is used to find, Conserved domains Conserved nucleotide pattern Protein and DNA sequencing
Local alignments
Global alignment In this alignment, entire sequence is aligned. Two sequences are assumed to be similar over their entire sequence. Sequences of high similarity and approx same length are suitable candidate for global alignment. Entire length of both same sequence length is aligned.
Significance of pair wise alignment We can, Identify shared domain Identify duplicated region Identify important features like catalytic domains and disulphide bridges Compare gene and its product.
Methods of pair wise sequence alignment Four methods Align by hand Dot plot Dynammic programming (slow, optimal) K- tuple word method ( FASTA and BLAST)
Align by Hand It uses scoring system. If the characters are identical, so positive score is given, if characters are different , so negative score to alignment is given ( usually called the quality).
Dot plot Also called dot matrix method. It is the graphical presentation of pairwise sequence alignment. In this, two sequences are placed on graph against their axis. If the residue of both sequences are matched so dot is placed and if not, so the place is left. Afterwards diagonal is drawn that covers most of dots. If there is interruptions in middle of diagonal, so it indicated the insertions or deletions. Major problem in dot plot is the emergence of noises. These are small diagonals that don’t provide significant or meaningful information.
Advantages Can be used to align protein and nucleotide sequences. Helpful to analyze long insertion, deletions and repetitions. Provides pictorial statement of the relationship between two sequences. Disadvantage s This method does not give perfect optimal alignment. It is difficult for the methods to scale up multiple alignment. Web tools for dot plot Webserver , dot matcher, dottup , dot helix, matrix plot.
Dynamic programming It also determines optimal alignment by matching pair between two sequences. It creates two dimensional alignment grid (matrix) in which two sequences are compared. Identical match is assigned a score 1, mismatch 0 and gap penalty -1. Gap is due to insertions/ deletions. Time consuming procedure.
Alignment procedure requires scoring system, called substitution matrix. Scoring matrices are used to determine the relative score made by matching two characters in a sequence alignment. These are usually log-odds of the likelihood of two characters being derived from a common ancestral character. There are many flavors of scoring matrices for amino acid sequences, nucleotide sequences, and codon sequences, and each is derived from the alignment of "known" homologous sequences. These alignments are then used to determine the likelihood of one character being at the same position in the sequence as another character.
Scoring matrix for nucleotide sequence is simple. A positive value or high score is given for match and a negative value or low score for mismatch. Scoring matrix for amino acid are more complicated because scoring has not only given to same amino acid residues but also given to those amino acids that have same biochemical properties.
Substitution matrices are used to score aligned positions is a sequence alignment procedure, usually of amino acids or nucleotide sequences. Two types of scoring matrix PAM (Point Accepted Mutations) by Margaret Dayhoff BLOSUM (Blocks Substitution Matrix) by Steven and Henikoff
PAM Point means low chances of mutation or fixed in the population, Accepted these mutations are accepted to some extent. These mutations are not drastic. Good for global alignment. Based on evolutionary relationships. PAM scoring matrix is a table that describes the odds (is the ratio of like hoods of two events) Odd means matching of residues. PAM matrices are derived from an analysis of observed amino acid substitution in families of closely related sequences. PAM1 is a scoring system for sequences in which 1% of the residues undergone mutation . Dayhoff measured the empirical probability of such substitutions and constructed PAM1 in which each element represents the likely rate of substitutions if 1% of the amino acids in a protein were to change. A PAM2 matrix is created by assuming that there was another 1% change that could possibly include some of the same amino acids that have already changed once. Repitition of this process 250 times results in the PAM250 matrix, the likely rates of amino acid substitutions (or changes) after a 250% amino acid turnover. Examines the kinds of mutations that occur in closely related protein sequences.
PAM250 is the most popular PAM matrix
BLOSUM Matrix It is for local alignment to get “blocks”. Blocks are locally conserved regions (regions are related in terms of structure and functions. For multiple sequence comparison.
K- tuple More efficient than dynamic programming E.g. BLAST and FASTA BLAST : discovery of a unknown gene in the mouse, a scientist will typically perform a BLAST with the human genome to see if human carry similar gene. BLAST will identify sequences in the human genome that resemble to the mouse gene based on similarity of sequence. FASTA: text based format for representing nucleotide or protein sequence.