Introduction to sequence alignment partii

6,443 views 23 slides Dec 22, 2019
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

gaps and gap penalty


Slide Content

INTRODUCTION TO SEQUENCE ALIGNMENT PART 2

Content METHOD TO WRITE AN ALIGNMENT A MATCH, A GAP, AND INDELS REPRESENTATION OF SUBSTITUTION, DELETION, AND INSERTION IN TRACES FEATURS OF GAP CAUSES OF GAPS OCCURRENCE OF GAPS TYPES OF GAPS AND GAP PENALTIES CONSTANT GAP PENALTY Linear Affine Convex Profile-based variable gap penalties Highlights of gap and gap penalty Example of assigning gaps and gap penalties

METHOD TO WRITE AN ALIGNMENT When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned. Alignments are conventionally shown as traces . In a symbolic sequence, each base or residue monomer in each sequence is represented by a single-letter codes . The convention is to print the for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This assumes that the combined monomers evenly spaced along the single dimension of the molecule’s primary structure .

A MATCH, A GAP, AND INDELS Every element in a trace is either a match or a gap. A MATCH -Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace . A GAP- When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its “ absence ” is labelled by a dash in the derived sequence. Since these dashes represent “gaps"(i.e. mutations are annotated as gaps in the sequences ) in one or other sequence. THE GAPPING- Action of inserting such spacers is known. A deletion in one sequence is symmetric with an insertion in the other i.e. when a residue appears to have been inserted to produce a longer sequence 'A' a dash appears opposite in the unaugmented sequence ‘B’. INDELS - Indeed, the two types of mutation are referred to together as. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence due to mutation.

REPRESENTATION OF SUBSTITUTION, DELETION, AND INSERTION IN TRACES A trace can represent a substitution (like point accepted mutation; amino acid V changes to I due to change in genetic code in DNA.) A trace can represent a deletion . (A residue or subsequence of DNA is deleted from a sequence; eg. amino acid E is deleted from the sequence due to absence of its genetic code in DNA.) VCGED VCG- D A trace can represent an insertion: (A residue or subsequence of DNA is inserted into a sequence. eg. amino acid L is inserted in the sequence due to addition of its genetic code in DNA.

FEATURES AND IMPORTANCE OF GAP AND GAP PENALTY A gap is a maximal consecutive run of spaces in a single string of a given alignment . It corresponds to an atomic insertion or deletion of a substring. The insertions or deletions comprise an entire subsequence and often occur from a single mutational event . Single mutational events can create gaps of different sizes , when scoring, the gaps need to be scored as a whole when aligning two sequences of DNA. Gap considers all possible alignments and gap positions between two sequences. It creates a global alignment that maximizes the number of matched residues and minimizes the number and size of gaps .

A scoring matrix is used to assign values for symbol matches. Besides, a gap creation penalty and a gap extension penalty are required to limit the insertion of gaps into the alignment . Gap uses the alignment method of Needleman and Wunsch (1970 ) that has been shown to be equivalent to Sellers (1974). The algorithm of Needleman and Wunsch is used to find the alignment of two complete sequences that maximizes the number of matches . Considering multiple gaps in a sequence as a larger single gap reduces the assignment of a high cost to the mutations. For instance, two protein sequences may be relatively similar however, may differ at certain intervals as one protein may have a different subunit compared to the other. Representing these differing sub-sequences as gaps will allow us to treat these cases as “good matches” even though there are long consecutive runs with indel operations in the sequence. Therefore, using a good gap penalty model will avoid low scores in alignments and improve the chances of finding a true alignment

CAUSES OF GAPS 1. A single mutation can create a gap (very common) 2 . Error in DNA replication can result in the repetition of strings of bases. 3 . Unequal crossover in meiosis can lead to insertion or deletion of strings of bases. 4 . Translocation of DNA between chromosomes . 5. Retrovirus insertion.

OCCURRENCE OF GAPS- 1- Before the first character of a string eg 2- Inside the string eg 3- After the last character of a string eg –

TYPES OF GAPS AND GAP PENALTIES Types of gap penalties are as follows- 1. Constant 2. Linear 3. Affine 4. Convex 5. Profile-based variable gap penalties

CONSTANT GAP PENALTY This is the simplest type of gap penalty: a fixed negative score is given to every gap , regardless of its length. Aligning two short DNA sequences, with '-' depicting a gap of one base pair. If each match was worth 1 point and the gap -1, the total score: 7 – 1 = 6.

Compared to the constant gap penalty, the linear gap penalty considers the length (L) of each insertion/deletion in the gap . Therefore, if the penalty for each inserted/deleted element is B and the length of the gap L; the total gap penalty would be the product of the two BL. This method favors shorter gaps , with total score decreasing with each additional gap. Unlike constant gap penalty, the size of the gap is considered. With `a match with score 1 and gap -1, the score here is (7 – 3 = 4). LINEAR GAP PENALTY

The most widely used gap penalty function is the affine gap penalty which combines the components in both the constant and linear gap penalty, taking the form A+(B.L). This introduces new terms , A is known as the gap opening penalty , B the gap extension penalty and L the length of the gap. Gap opening refers to the cost required to open a gap of any length, and gap extension the cost to extend the length of an existing gap by 1. Affine gap penalty encourages the extension of gaps rather than the introduction of a new gap. AFFINE GAP PENALTY

AFFINE GAP PENALTY(cont.) Often it is unclear as to what the values A and B should be as it differs according to purpose . In general, if the interest is to find closely related matches (e.g. removal of vector sequence during genome sequencing), a higher gap penalty should be used to reduce gap openings. On the other hand, gap penalty should be lowered when interested in finding a more distant match . The relationship between A and B also influence gap size . If the size of the gap was important, a small A and large B (costlier to e xtend gap) is used and vice versa.

Using the affine gap penalty requires the assigning of fixed penalty values for both opening and extending a gap. This can be too rigid for use in a biological context. The logarithmic gap takes the form G(L) = A + ClnL and was proposed as studies had shown the distribution of indel sizes obey a power law. Another proposed issue with the use of affine gaps is the favoritism of aligning sequences with shorter gaps. Logarithmic gap penalty was invented to modify the affine gap so that long gaps are desirable. However, in contrast to this, it has been found that using logarithmic models had produced poor alignments when compared to affine models. CONVEX GAP PENALTY

Profile-profile alignment algorithms are powerful tools for detecting protein homology relationships with improved alignment accuracy. Profile-profile alignments are based on the statistical indel frequency profiles from multiple sequence alignments generated by PSI-BLAST searches. Rather than using substitution matrices to measure the similarity of amino acid pairs, profile-profile alignment methods require a profile-based scoring function to measure the similarity of profile vector pairs . Profile-profile alignments employ gap penalty functions . The gap information is usually used in the form of indel frequency profiles , which is more specific for the sequences to be aligned. PROFILE-BASED VARIABLE GAP PENALTIES

ClustalW and MAFFT adopted this kind of gap penalty determination for their multiple sequence alignments . Alignment accuracies can be improved using this model, especially for proteins with low sequence identity. Some profile-profile alignment algorithms also run the secondary structure information as one term in their scoring functions, which improves alignment accuracy. PROFILE-BASED VARIABLE GAP PENALTIES(cont.)

HIGHLIGHTS OF GAP AND GAP PENALTY 1. By insertion of an element into sequence alignment it is possible to achieve a good residue to residue alignment at some other neighboring point in the sequence. 2. Insertion of gaps into pairwise sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other . 3. A penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment. 4.The gap penalty is used to help decide if accept or not to accept a gap. 3. 4.

5. If gap penalty is very lower or not included (gap introduced at any position) then a sequence alignment score achievable even between unrelated or random sequences ; and this is not desired. 6. Genetically, it is expected that a protein will accept a different residue in a position rather than having parts of sequences chopped away or inserted. 7. Gaps or insertions should therefore be rarer than point mutations or substitution. Still gaps are introduced in the alignments to optimize the alignment score . 8. It may be concluded that a variety of gap penalties (from zero to some significant punishment) must be tried and from these one must determine the effects that this has on the result. HIGHLIGHTS OF GAP AND GAP PENALTY(cont.)

EXAMPLE OF ASSIGNING GAPS AND GAP PENALTIES . This is an extension to the Advanced Dynamic Programming . Scores used is +2 for a match, -2 for a gap, and -1 for a mismatch. Fig. shows regular gap penalty. Fig. shows assignment of affine gap penalties to the first alignment.

EXAMPLE OF ASSIGNING GAPS AND GAP PENALTIES(cont.) Fig. regular gap penalty alignment can be written like this too without changing the score. Fig. rescoring of second alignment using affine gap penalties.

REFERENCES-(PART1 & 2) 1. Point accepted mutation. https://en.wikipedia.org/wiki/ Point_accepted_mutation . 2. Adansonian Classification - Medical Definition from MediLexicon www.medilexicon.com/dictionary/18016 3. S.C. Rastogi, Namita Mendiratta, Parag.Rastogi. Bioinformatics concepts, Skills & Applications. CBS Publishers & distributors. New Delhi. http://www.cbspd.com 4. D.R. Westhead, J.H., J.H.Parish and R.M. Twyman. . Instant Notes bioinformatics. Viva books Private Limited.

5. https://en.m.wikipedia.org/wiki/Simple_matching_coefficient 6. Sequence alignment.https ://www.bioinformatics.org/wiki/ Sequence_alignment 7. Gap penalty. https://en.wikipedia.org/wiki/Gap_penalty#Types 8. Bioinformatics theory and practice. By N. J. Chikhale and V.S. Gomase. Himalaya publishing House. www.himpub.com REFERENCES-(PART1 & 2)cont.
Tags