protein structure prediction in bioinformatics.ppt
DrSudha2
176 views
33 slides
Sep 14, 2024
Slide 1 of 33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
About This Presentation
Protein sequence from genomic DNA.
Protein 3D structure from sequence.
Protein function from structure.
Protein function from sequence.
Size: 426.49 KB
Language: en
Added: Sep 14, 2024
Slides: 33 pages
Slide Content
Protein structure prediction
Prediction in bioinformatics
Important prediction problems:
Protein sequence from genomic DNA.
Protein 3D structure from sequence.
Protein function from structure.
Protein function from sequence.
From DNA to Cell Function
DNA sequence
(split into genes)
Amino Acid
Sequence
Protein
3D
Structure
Protein
Function
Cell
Activity
codes for
folds into
dictates
determines
has
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG
HLLTKSPSLN AAKSELDKAI GRNCNGVITK
DEAEKLFNQD VDAAVRGILR NAKLKPVYDS
LDAVRRCALI NMVFQMGETG VAGFTNSLRM
LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI
TTFRTGTWDA YKNL
?
Protein structure: Limitations
•Not all proteins or parts of proteins assume a well-defined
3D structure in solution.
•Protein structure is not static, there are various degrees of
thermal motion for different parts of the structure.
•There may be a number of slightly different
conformations in solution.
•Some proteins undergo conformational changes when
interacting with certain substances.
•Expected best residue-by-residue accuracies for secondary
structure prediction from multiple protein sequence
alignment.
•To address detailed functional biological questions.
Experimental Protein Structure Determination
•X-ray crystallography
–the most advanced method available for obtaining high-resolution
structural information about biological macromolecules
–in vitro
–needs crystals
–~$100-200K per structure
•NMR
–fairly accurate
–in vivo
–no need for crystals
–limited to very small proteins
•Cryo-electron-microscopy
–imaging technology
–low resolution
Why predict protein structure?
•Over millions known sequences, 1,25,309 known structures.
•Structural knowledge brings understanding of function and
mechanism of action.
•Predicted structures can be used in structure-based drug design.
•It can help us understand the effects of mutations on structure and
function.
•To analyze sequence structure gap.
•Can help in prediction of function.
•It is a very interesting scientific problem-50 years effort.
•Prediction in one dimension
–Secondary structure prediction
–Surface accessibility prediction
• Historically first structure prediction methods predicted
secondary structure.
• Can be used to improve alignment accuracy.
• Can be used to detect domain boundaries within proteins
with remote sequence homology.
• Often the first step towards 3D structure prediction.
• Informative for mutagenesis studies.
Secondary structure prediction
Predicting Secondary Structure From Primary Structure
•accuracy 64-75%.
•higher accuracy for a-helices than for b-sheets.
•accuracy is dependent on protein family.
•predictions of engineered (artificial) proteins are less accurate.
Assumptions
• The entire information for forming secondary structure is contained
in the primary sequence.
• Side groups of residues will determine structure.
• Examining windows of 13-17 residues is sufficient to predict secondary
structure .
-α-helices 5–40 residues long
-β-strands 5–10 residues long
Why Secondary Structure Prediction?
•Simply easier problem than 3D structure prediction.
•Accurate secondary structure prediction can be an important
information for the tertiary structure prediction.
•Improving alignment accuracy.
•Protein function prediction.
•Protein classification.
Protein structure prediction
•The inference of the three-dimensional structure of
a protein from its amino acid sequence.
–i.e. the prediction of its folding and its secondary and tertiary
structure from its primary structure.
•Structure prediction is fundamentally different from the
inverse problem of protein design.
•Protein structure prediction is one of the most important
goals pursued by bioinformatics and theoretical chemistry.
•It is highly important in medicine (in drug design)
and biotechnology (in the design of novel enzymes).
Methods of structure prediction
Ab initio protein folding approaches
Comparative (homology) modelling
Fold recognition/threading
History of protein secondary structure prediction
First generation
Based on single residue statistics.
Example: Chou-Fasman method, LIM method, GOR I, etc
Accuracy: low
Secondary generation
Based on segment statistics.
Examples: ALB method, GOR III, etc
Accuracy: ~60%
Third generation
Based on long-range interaction, homology based
Examples: PHD
Accuracy: ~70%
First generation methods:
single residue statistics
Chou & Fasman (1974 & 1978) :
Some residues have particular secondary-structure preferences.
Based on experimental frequencies of residues in -helices, -sheets,
and coils.
Examples: Glu α-helix
Val β-strand
Accuracy ~50 - 60% Q3
Chou-Fasman statistics
•R – amino acid, S- secondary structure
•f(R,S) – number of occurrences of R in S
•Ns – total number of amino acids in conformation S
•N – total number of amino acids
•P(R,S) – propensity of amino acid R to be in structure S
•P(R,S) = (f(R,S)/f(R))/(Ns/N)
Example
•#residues=20,000,
•#helix=4,000,
•#Ala=2,000,
•#Ala in helix=500
•f(Ala, ) = 500/20,000,
α
•f(Ala) = 2,000/20,000
•p( ) = / =4,000/20,000
α Να Ν
•P = (500/2000) / (4,000/20000) = 1.25
Second generation methods: segment statistics
•Similar to single-residue methods, but incorporating
additional information (adjacent residues, segmental
statistics).
•Problems:
–Low accuracy - Q3 below 66% (results).
–Q3 of -strands (E) : 28% - 48%.
–Predicted structures were too short.
The GOR method
•Developed by Garnier, Osguthorpe & Robson
•Build on Chou-Fasman Pij values
•Evaluate each residue PLUS adjacent 8 N-terminal and 8
carboxyl-terminal residues
•Sliding window of 17 residues.
•underpredicts b-strand regions
•GOR method accuracy Q3 = ~64%
Third generation methods
•Third generation methods reached 77% accuracy.
•They consist of two new ideas:
1. A biological idea –
Using evolutionary information based on
conservation analysis of multiple sequence
alignments.
2. A technological idea –
Using neural networks.
Artificial Neural Networks
An attempt to imitate the human brain (assuming that
this is the way it works).
Neural network models
-machine learning approach
-provide training sets of structures (e.g. a-helices, non
a -helices)
-computers are trained to recognize patterns in known
secondary structures
-provide test set (proteins with known structures)
-accuracy ~ 70 –75%
Correlation coefficient
True positive
p
α
False positive
(overpredicted)
o
α
True negative
n
α
False negative
(underpredicted)
u
α
])][][[]([
opuponun
ounp
C
C
a
= 1 (=100%)
Reasons for improved accuracy
•Align sequence with other related proteins of the
same protein family.
•Find members that has a known structure.
•If significant matches between structure and sequence
assign secondary structures to corresponding
residues.
New and Improved Third-Generation Methods
Exploit evolutionary information. Based on conservation
analysis of multiple sequence alignments.
• PHD (Q3 ~ 70%)
Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.
• PSIPRED (Q3 ~ 77%)
Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.
Arguably remains the top secondary structure prediction method.
Protein 3D structure data
The structure of a protein consists of the 3D (X,Y,Z) coordinates of each
non-hydrogen atom of the protein.
Some protein structure also include coordinates of covalently linked
prosthetic groups, non-covalently linked ligand molecules, or metal ions.
For some purposes (e.g. structural alignment) only the Cα coordinates are
needed.
Example of PDB format: X Y Z occupancy / temp.
ATOM 18 N GLY 27 40.315 161.004 11.211 1.00 10.11
ATOM 19 CA GLY 27 39.049 160.737 10.462 1.00 14.18
ATOM 20 C GLY 27 38.729 159.239 10.784 1.00 20.75
ATOM 21 O GLY 27 39.507 158.484 11.404 1.00 21.88
Note: the PDB format provides no information about connectivity between
atoms. The last two numbers (occupancy, temperature factor) relate to
disorders of atomic positions in crystals.
Building a protein structure model from X-ray data
Building a protein structure model from NMR data
Computing the energy for a given protein structure (conformation)
Energy minimization: Finding the structure with the minimal energy according
to some empirical “force fields”.
Simulating the protein folding process (molecular dynamics)
Structure visualizationStructure visualization
Computing secondary structure from atomic coordinates
Protein superposition, structural alignmentProtein superposition, structural alignment
Protein fold classificationProtein fold classification
Threading: finding a fold (prototype structure) that fits to a sequenceThreading: finding a fold (prototype structure) that fits to a sequence
Docking: fitting ligands onto a protein surface by molecular dynamics or energy
minimization
Protein 3D structure prediction from sequenceProtein 3D structure prediction from sequence
Protein structure: Some computational tasksProtein structure: Some computational tasks
Viewing protein structures
When looking at a protein structure, we may ask the following types of
questions:
•Is a particular residue on the inside or outside of a protein?
•Which amino acids interact with each other?
•Which amino acids are in contact with a ligand (DNA, peptide
hormone, small molecule, etc.)?
•Is an observed mutation likely to disturb the protein structure?
Standard capabilities of protein structure software:
•Display of protein structures in different ways (wireframe, backbone,
sticks, spacefill, ribbon.
•Highlighting of individual atoms, residues or groups of residues
•Calculation of interatomic distances
•Advanced feature: Superposition of related structures