"Optimizing Drug Discovery (ADMET) using Machine Learning" involves leveraging advanced algorithms to enhance the drug development process. By analyzing Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) data with ML models, researchers can predict a drug candidate's...
"Optimizing Drug Discovery (ADMET) using Machine Learning" involves leveraging advanced algorithms to enhance the drug development process. By analyzing Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) data with ML models, researchers can predict a drug candidate's properties, safety, and efficacy. This approach accelerates the identification of potential drugs, reduces costs, and minimizes the likelihood of late-stage failures. Machine learning aids in the selection of promising compounds, ultimately improving the efficiency and success of drug discovery, benefiting both pharmaceutical companies and patients by delivering safer and more effective medications.
Size: 1.03 MB
Language: en
Added: Nov 03, 2023
Slides: 18 pages
Slide Content
Optimizing Drug Discovery using ADMET Translating Data into Actionable Insights and Decisions using ML Santu Chall ME, MCA
C10H9NO3 SMILES : SMILES ( S implified M olecular I nput L ine E ntry S ystem) is a concise notation for representing chemical structures in a line of text. For example : OC(=O)CN1C(=O)Cc2c1cccc2 Molecular Representation 0D/1D 2 D 3 D 4 D Descriptors : Molecular descriptors are quantitative values that characterize chemical structures, aiding in structure-property relationships and computational chemistry analysis . For example: MW, HBA, HBD, no_of_atom etc. Software : There are various software that can calculate and analyze chemical properties. Such as RDKit , ChemAxon , Dragon, PaDEL , MOE etc etc
Molecular Fingerprint Binary Representation of Molecule for fast, objective and compact “keyed” fingerprint indicates the present or absent of a structural features Task search and comparison, prediction and clustering Types of fingerprint Selecting the right Fingerprint
ADMET A bsorption D istribution M etabolism E xcretion/ E limination T oxicity
Data Selection Online Database : ChEMBL , PubChem , ChemDB , ChemSpider , DrugBank etc Scientific Reputed Journal : Journal of Chemical Information and Modeling , Journal of Cheminformatics , Journal of Computer-Aided Molecular Design etc etc Data Retrival from the Liturature : PubMed, ScienceDirect , Google Scholar, ACS Publications, Open-access journals etc etc
Data Division Random Division ( train_test_split (X, y, test_size =0.3, random_state =42 ) Kennord -Stone Division : Selecting the two data points that are farthest apart in the feature space. Activity Based Division : Selecting specific activity or property in predicting or modeling. Represent the full range of activity levels in the dataset . Euclidean Distance Based: Compute the Euclidean distance between all pairs of data in a multidimensional space. ( euclidean_distances = np.linalg.norm (X[:, np.newaxis ] - X, axis=2 ) K-Medois based: Clustering algorithm that divides data into groups. ( clusterer = KMedoids ( n_clusters =K, random_state =0 )
Feature Selection Genetic Algorithm : GA’s feature selection is the process of choosing a subset of the most relevant features (variables) from the original feature set to improve model performance and reduce computational complexity . ga = GeneticAlgorithm ( num_features = X.shape [1], fitness_func = fitness_function ) Lasso Feature Selection: Lasso ( L east A bsolute S hrinkage and S election O perator) adding a penalty term to the linear regression or logistic regression cost function, which encourages the model to set the coefficients of some features to zero, effectively removing them from the model. lasso = sklearn.linear_model.Lasso (alpha=1.0) Stepwise Selection: select the most relevant features (Forward Selection, Backward Elimination, Bidirectional Selection, Stopping Criteria) rfe = sklearn.feature_selection.RFE ( LogisticRegression (), 10) # Select the top 10 features
Learning Algorithm Supervised Regression - build predictive models for tasks where the goal is to predict a continuous numeric value. Example : Random Forest Regression(RF), Support Vector Regression(SVR),Decision Tree Regression ,K -Nearest Neighbors Regression, Neural Networks for Regression etc etc Classification - build models that categorize data into predefined classes or categories. Example: Logistic Regression, Decision Trees, Support Vector Machines (SVM ) , K-Nearest Neighbors (KNN) Unsupervised Clustering: used to group data into clusters based on inherent patterns or similarities in the data . Example: K-Means Clustering, X-Means, Gaussian Mixture Models (GMM) Dimensionally Reduction: used to reduce the number of features or dimensions in a dataset while preserving important information and patterns.Example : Principal Component Analysis (PCA), Independent Component Analysis (ICA ), Autoencoders
Absorption Property Definition Used Model and Method %Abs Absorption Rate Percentage through the (Intestinal) Barrier RF and MACCS Key %HIA The absorbed percentage through the human GI tract. RF and MACCS Key Caco2 Artificial membrane models predict absorption with paracellular and active transport. RF and Descriptor Pgp Inhibiting P-glycoprotein (P- gp ) function to enhance drug absorption. SVM and ECFP4 Amount absorbed Compound absorption weight per kilogram of body weight. RF and Descriptor
Distribution Property Definition Used Model and Method BBB partitioning Brain-blood barrier partitioning: Brain vs. blood concentration ratio (serum/plasma). SVM and ECFP2 %PPB Protein binding percentage of the compound in plasma. RF and Descriptor Vd Volume of distribution within the body RF and Descriptor Fbt Fraction bound in tissues SVM and Descriptor Ktb Tissue-blood partition coefficient measure the distribution of a substance between a specific tissue and the blood. SVM and PubChem FP
Metabolism Property Definition Used Model and Method Primary enzyme Predominant enzyme accountable for metabolism (CYP P450 1A2, 2C9, 2C19, 2D6, 3A4 etc ) 1A2 – SVM and ECFP4 2C9 – RF and ECFP2 2C19 – SVM and ECFP2 2D6 – RF and ECFP4 3A4 – SVM and ECFP4 % metabolised Overall percentage of metabolism SVM and MACCS % excreted The proportion of the compound excreted unchanged in urine. RF and Descriptor Vmax Maximum velocity of metabolic reaction SVM and MACCS Cliv Clearance rate in liver RF and Descriptor
Excretion/ Elemination Property Definition Used Model and Method Clr Renal clearance RF and Descriptor Cltot Total clearance across all routes SVM and MACCS key AUC Area under concentration time curve RF and Descriptor t 1⁄2 Half-life: Time for compound concentration to reduce by 50% RF and Descriptor Tmax Time to achieve peak concentration RF and Descriptor
Toxicity Property Definition Used Model and Method hERG hERG encodes a potassium ion channel potentially causing adverse effects on the heart's electrical activity. RF and Descriptor and MACCS LD 50 acute toxicity of a substance, meaning its potential to cause harm within a short period after exposure. RF and Descriptor DILI ingestion of a drug or medication leads to damage, injury, or dysfunction of the liver RF and MACCS key Hepatotoxicity harmful effects or damage to the liver caused by drugs RF and Descriptor SkinSen skin's response to certain allergens RF and MACCS
Model Analysis and Performance Predictive Variance : measures prediction variability; high variance means less precision . Calculation of MAPE (Mean Absolute Percentage Error), MAE (Mean Absolute Error ). Model Quality : refers to the effectiveness, reliability, and performance of a machine learning. Calculation of confusion matrix (Accuracy, Precision, Recall (Sensitivity), Specificity, F1 Score ). Error Analysis : investigate and analyze model errors to identify patterns or areas where the model may need improvement, then fine-tune the model or collect more relevant data . Check response times and throughput to ensure the model can handle the required workload without causing delays Model Versioning: k eep track of different model versions to understand which versions are performing best and to facilitate easy rollback in case of issues . Scheduled Retraining: set up a retraining schedule to periodically update the model with new data. This is essential to adapt to changing patterns in the data.
Model Monitoring Data Processing Issue: Data Quality Checks, Data Consistency, Input Validation , Pipeline Monitoring, Logging and Alerting Data Scheme Changes: Validate I ncoming data, Automated Alerts, Data Transformation Monitoring. Data Loss at the Source: Recovery Mechanisms, Data Ingestion Monitoring, Logging and Auditing Anomaly Detection : unusual behavior in model outputs or predictions that may indicate a problem, such as a sudden increase in errors Model Documentation : Data Sources, Testing and Validation, Model Performance
Current Working Generate molecule (or similar molecule) with(almost) desired properties using generative AI(RNN, GNN etc ) Checking fit score for compatibility Working on automated energy minimisation of structure. Working on DEL, EGFR VIII data analysis Working on various different biological data analysis(NGS, PacBio ) project. Github : https://github.com/santuchal/ ADMET Medium: https://medium.com/@santuchal/admet-an-essential-component-in-drug-discovery-and-development- f503a5aae5dd Streamlit : https://hav8whwegtyvgwjixnhxqw.streamlit.app/