Quantitative Structure-Activity Relationship Elvis A. F. Martis Graduate Student (Ph.D.) Department of Pharmaceutical Chemistry Bombay College of Pharmacy 1
Developing New QSAR methodologies CoRIA and its Variants HomoSAR LISA eCoRIA and eQSAR CoOAN Solving Protein Structures ( using NMR) Computational Prediction of Resistance and QMAR Lead optimization strategies for Anti-TB, Dengue, AD etc Studies on reaction pathways and transition states using ab initio and Quantum Mechanics. Molecular dynamics of Drug- Cyclodextrin complexes Research in Prof. Coutinho’s Lab
Molecular Modeling in Drug Design
What is QSAR? Compounds + biological activity New compounds with improved biological activity QSAR
The number of compounds required for synthesis in order to place 10 different groups in 4 positions of benzene ring is 10 4 Solution: synthesize a small number of compounds and from their data derive rules to predict the biological activity of other compounds . Why QSAR?
QSAR date back to the 19th century A.F.A. Cros (University of Strasbourg; 1863) Increased toxicity of alcohols with decrease in water solubility H. H. Meyer (University of Marburg; 1890’s) and Charles Ernest Overton (University of Zurich; 1890’s) [working independently] Toxicity of organic compounds depended on their lipophilicity Crum-Brown and Fraser the physiological action of a substance was a function of its chemical composition and constitution Richet inverse relationship between the cytotoxicities of a diverse set of simple organic molecules with water solubilities
Hammett, "sigma-rho” culture; to understand the effect of substituents on organic reactions Taft devised a way to separate polar, steric, and resonance effects and introduced the first steric parameter, Es Hansch and Fujita The contributions of Hammett and Taft together laid the mechanistic basis for the development of the QSAR paradigm
Hammett Equation Linear Free Energy Relationships Louis Hammett (1894-1987), correlated electronic properties of organic acids and bases with their equilibrium constants and reactivity Measures the electron withdrawing or electron donating effects in comparison to benzoic acid & how affected its ionization) Consider the dissociation of benzoic acid:
Hammett Equation m-NO2 increases dissociation constant (nitro group is EWG stabilizing the negative charge) p-NO2 exhibits greater electron withdrawing effect p-C2H5 group on benzoic acid
Hammett observed similar substituent effects on the organic acids and bases dissociation like phenyl acetic acid. Hammett Equation
A linear free-energy relationship is said to exist if ‘the same series of changes in conditions affects the rate or equilibrium of a second reaction in exactly the same way as the first’ The free energy is proportional to the logarithm of the equilibrium constant Graph for a linear free energy relationship
The following equation was derived as the relationship is linear; where r is the slope of the line and the abscissa values are always those for benzoic acid and are given the symbol, s (substituent constant); equation simplified as: r (reaction constant) relates the effect of substituents on that equilibrium to the effect of those substituents on the benzoic acid equilibrium The reaction constant depends on the nature of the chemical reaction as well as the reaction conditions (solvent, temperature, etc. ) The sign and magnitude of the reaction constant are indicative of the extent of charge build up during the reaction progress
Reactions with ρ > 0 are favored by electron withdrawing groups ( i.e. , the stabilization of negative charge) Reactions with ρ < 0 are favored by electron donating groups ( i.e. , the stabilization of positive charge) For benzoic acid r is equal to 1.00 in pure water at 25 o C s is a descriptor of the substituents; The magnitude of s gives the relative strength of the electron-withdrawing or -donating properties of the substituents s is positive if the substituent is electron-withdrawing and; s is negative if substituent is electron-donating The relationships as developed by Hammett are termed linear free energy relationships
By definition, s for hydrogen is ZERO Positive s for the NO 2 group indicate electron- withdrawing effect m -NO 2 (inductive effect); while p-NO 2 (inductive + resonance effect) Electronegative chlorine produce an inductive electron- withdrawing effect The magnitude of the effect in the p - Cl position being less than in the m - Cl , and only the inductive effect is possible with chlorine CH 3 O- group can be electron- donating or - withdrawing , depending on the position of substitution m -CH 3 O an inductive electron- withdrawing effect is seen p -CH 3 O only a small inductive effect is expected; an electron- donating resonance effect occurs for p -CH 3 O , giving an overall electron- donating effect Hammett Constant
Applications of the Hammett Equation The prediction of the pK a of ionization equilibria Therefore, For benzoic acid the equation is Consider for substituted benzoic acid Given s meta =0.71 for NO 2 and s para =-0.13 for CH 3 groups, calculated pK a =2.91, compared to the experimental value of 2.97
Applications of the Hammett Equation The applicability of Hammett's electronic descriptors in a QSAR relating the inhibition of bacterial growth by a series of sulfonamides where X represents various substituents A QSAR was developed based on the s values of the substituents where C is the minimum concentration of compound that inhibited growth of E. coli The electron-withdrawing substituents favor inhibition of growth
Log P is a measure of the drug’s hydrophobicity , which was selected as a measure of its ability to pass through cell membranes. The log P (or log P o/w ) value reflects the relative solubility of the drug in octanol (representing the lipid bilayer of a cell membrane) and water (the fluid within the cell and in blood). Log P values may be measured experimentally or, more commonly, calculated . Hansch’s Approach
Hansch’s Approach
The Hammett substituent constant ( s ) reflects the drug molecule’s intrinsic reactivity, related to electronic factors caused by aryl substituents . In chemical reactions, aromatic ring substituents can alter the rate of reaction by up to 6 orders of magnitude! For example, the rate of the reaction below is ~10 5 times slower when X = NO 2 than when X = CH 3
Log 1/C = S a i + m where C =predicted activity, a i = contribution per group, and m =activity of reference Free-Wilson Analysis Log 1/C = -0.30 [ m -F] + 0.21 [ m - Cl ] + 0.43 [ m -Br] + 0.58 [ m -I] + 0.45 [ m -Me] + 0.34 [ p -F] + 0.77 [ p - Cl ] + 1.02 [ p -Br] + 1.43 [ p -I] + 1.26 [ p -Me] + 7.82
8. Topliss Scheme Used to decide which substituents to use if optimising compounds one by one (where synthesis is complex and slow) Example: Aromatic substituents
Rationale Replace H with para -Cl (+ p and + s ) + p and/or + s advantageous favourable p unfavourable s + p and/or + s disadvantageous Act . Little change Act. add second Cl to increase p and s further replace with OMe (- p and - s ) replace with Me (+ p and - s ) Further changes suggested based on arguments of p, s and steric strain 8. Topliss Scheme
Chemometrics in QSAR 23
Contents Basics of regression analysis - linear and multiple linear regression, Introduction to PCA & PCR, PLS, ANN and GFA. Validation of QSAR models Correlation coefficients (r 2 and r 2 pred ), F-test, standard error, cross-validation by calculation of q2, boot-strap analysis and randomization. Applicability domain for predictions using a QSAR model. Design of training and test sets using factorial design
Linear and multiple linear Regression (Image Coutesy : CAMO Software AS) Linear Data Non-Linear Data
Data structure Y-variable X-variable Objects, same number in x and y-column 2 4 1 . . . 7 6 8 . . .
b b 1 y =b +b 1 x+e x y Least squares (LS) used for estimation of regression coefficients Simple linear regression Error
Model Data (X,Y) Regression analysis Future X Prediction What does Regression analysis Do Outliers? Pre-processing Interpretation
Linear and Multiple linear Regression When to use When no. of observations more than no. of variables Not used in current QSAR formalisms Limitations Inaccurate when inter-correlated variable are present Cannot be applied when no. of variables are more than observations
Principle Component Analysis (PCA) PCA Overcomes all Limitations in Linear Regression Data compression
Basic Principle of Principle Components Variable Matrix Score Matrix Loading Matrix Error or Residue
Regression by data compression Regression on scores PC1 t-score y q t i PCA to compress data x 1 x 2 x 3
More than one Principle Components PC1 PC2 75% 15% 15% 100%
Comparision of MLR, PCA and PLS x4 x1 x2 x3 x4 x2 x3 x1 x2 x4 x3 y y y t1 t2 MLR PCR PLS x1 t1 t2
Genetic Function Approximation (GFA) and Genetic/Partial Least Squares (G/PLS)
Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN)
Backpropagation Networks Attributed to Rumelhart and McClelland, late 70’s To bypass the linear classification problem, we can construct multilayer networks. Typically we have fully connected , feedforward networks. I1 I2 1 Hidden Layer H1 H2 O1 O2 Input Layer Output Layer W i,j W j,k 1’s - bias I3 1
Validation of QSAR Models Internal validation: The correlation coefficient, r Pearson’s correlation coefficient, r 2 Cross-validation (CV ) Leave-one-out Leave-few-out Bootstrapping Randomization or y-scrambling Fischer statistic ( F value ) Full Sequential External Validation Predictive correlation coefficient ( r 2 pred )
Practical Considerations for QSAR modeling How to Begin? What to do? What to Expect? How to Conclude?
Selection of training and test set using factorial designs In factorial designs the investigated factors are varied at fixed levels. Each factor ( chemical feature or descriptors ) is investigated at levels based on type of factorial experiment. Full factorial design for K chemical features/descriptors at two levels gives n K compounds.
Experiments in a design with three variables Group π E s MR H 0.00 0.00 1.03 CH 3 0.56 -1.24 5.65 C 2 H 5 1.02 -1.31 10.30 n-C 3 H 7 1.55 -1.60 14.96 i-C 3 H 7 1.53 -1.71 14.96 n-C 4 H 9 2.13 -1.63 19.61 t-C 4 H 9 1.98 -2.78 19.62 H 2 C=CH** 0.82 10.99 C 6 H 5 ** 1.96 -3.82 25.36 CH 2 Cl 0.17 -1.48 10.49 CF 3 0.88 -2.40 5.02 CN -0.57 -0.51 6.33 F 0.14 -0.46 0.92 Cl 0.71 -0.97 6.03 Br 0.86 -1.16 8.88 I 1.12 -1.40 13.94 OH -0.67 -0.55 2.85 OCH 3 -0.02 -0.55 7.87 OCH 2 CH 3 0.38 12.47 SH 0.39 -1.07 9.22 SCH 3 0.61 -1.07 13.82 NO 2 ** -0.28 -2.52 7.36 2 3 factorial Design
Applicability Domain in QSAR OECD Definition : Applicability domain (AD) of a QSAR model is the physico -chemical, structural or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds . A new European legislation on chemicals – REACH (Registration, Evaluation, Authorization and restriction of Chemicals ) came into force in 2007. Purpose Reliably application of (Q)SAR I ntrapolation is better Extrapolation
What are the key aspects in defining the AD of QSAR models ? Identification of the subspace of chemical structures. Defined AD determines the degree of generalization of a given predictive model. A well defined AD indicates if the endpoint for the chemical structures under evaluation can be reliably predicted. Characterization of the interpolation space is very significant to define the AD for a given QSAR model
How can the AD of a model be defined ? Range Based methods Bounding Box or convex hull PCA Bounding Box Distance based methods Geometric Methods Probability Density Distribution based methods Empty Region Dense region Bounding Box or convex hull
Is it correct to say : “prediction result is always reliable for a point within the application region” ? “prediction is always unreliable if the point is outside the application region” ? Concluding remark NO!