Optimizing Drug Discovery using
ADMET
Translating Data into Actionable Insights and Decisions using ML
Santu Chall
ME, MCA
C10H9NO3
SMILES : SMILES ( implified olecular nput ine ntry ystem) is a concise notation
for representing chemical structures in a line of text.
For example : OC(=O)CN1C(=O)Cc2c1cccc2
Molecular Representation
0D/1D 2D 3D 4D
Descriptors : Molecular descriptors are quantitative values that characterize chemical
structures, aiding in structure-property relationships and computational
chemistry analysis. For example: MW, HBA, HBD, no_of_atom etc.
Software : There are various software that can calculate and analyze chemical properties.
Such as RDKit, ChemAxon, Dragon, PaDEL, MOE etc etc
Molecular Fingerprint
• Binary Representation of Molecule for fast,
objective and compact
• “keyed” fingerprint indicates the present or
absent of a structural features
• Task search and comparison, prediction and
clustering
• Types of fingerprint
• Selecting the right Fingerprint
ADMET
bsorption
istribution
etabolism
xcretion/ limination
oxicity
Data Selection
• Online Database : ChEMBL, PubChem, ChemDB,
ChemSpider, DrugBank etc
• Scientific Reputed Journal : Journal of Chemical
Information and Modeling, Journal of
Cheminformatics, Journal of Computer-Aided
Molecular Design etc etc
• Data Retrival from the Liturature : PubMed,
ScienceDirect, Google Scholar, ACS Publications,
Open-access journals etc etc
Data Division
• Random Division (train_test_split(X, y, test_size=0.3, random_state=42)
• Kennord-Stone Division : Selecting the two data points that are
farthest apart in the feature space.
• Activity Based Division : Selecting specific activity or property in
predicting or modeling. Represent the full range of activity levels in the dataset.
• Euclidean Distance Based: Compute the Euclidean distance
between all pairs of data in a multidimensional space. (euclidean_distances =
np.linalg.norm(X[:, np.newaxis] - X, axis=2)
• K-Medois based: Clustering algorithm that divides data into groups.
(clusterer = KMedoids(n_clusters=K, random_state=0)
Feature Selection
• Genetic Algorithm : GA’s feature selection is the process of choosing a
subset of the most relevant features (variables) from the original feature set to
improve model performance and reduce computational complexity.
ga = GeneticAlgorithm(num_features=X.shape[1], fitness_func=fitness_function)
• Lasso Feature Selection: Lasso (Least Absolute Shrinkage and
Selection Operator) adding a penalty term to the linear regression or logistic
regression cost function, which encourages the model to set the coefficients of some
features to zero, effectively removing them from the model.
lasso = sklearn.linear_model.Lasso(alpha=1.0)
• Stepwise Selection: select the most relevant features (Forward
Selection, Backward Elimination, Bidirectional Selection, Stopping Criteria)
rfe = sklearn.feature_selection.RFE(LogisticRegression(), 10) # Select the top 10 features
Learning Algorithm
• Supervised
– Regression - build predictive models for tasks where the goal is to predict a
continuous numeric value. Example : Random Forest Regression(RF), Support
Vector Regression(SVR),Decision Tree Regression,K-Nearest Neighbors Regression,
Neural Networks for Regression etc etc
– Classification - build models that categorize data into predefined classes
or categories. Example: Logistic Regression, Decision Trees, Support Vector
Machines (SVM), K-Nearest Neighbors (KNN)
• Unsupervised
– Clustering: used to group data into clusters based on inherent patterns or
similarities in the data. Example: K-Means Clustering, X-Means, Gaussian Mixture
Models (GMM)
– Dimensionally Reduction: used to reduce the number of features or
dimensions in a dataset while preserving important information and
patterns.Example: Principal Component Analysis (PCA), Independent Component
Analysis (ICA), Autoencoders
Absorption
Property Definition Used Model and Method
%Abs Absorption Rate Percentage
through the (Intestinal)
Barrier
RF and MACCS Key
%HIA The absorbed percentage
through the human GI tract.
RF and MACCS Key
Caco2 Artificial membrane models
predict absorption with
paracellular and active
transport.
RF and Descriptor
Pgp Inhibiting P-glycoprotein (P-
gp) function to enhance
drug absorption.
SVM and ECFP4
Amount absorbed Compound absorption
weight per kilogram of body
weight.
RF and Descriptor
Distribution
Property Definition Used Model and Method
BBB partitioning Brain-blood barrier
partitioning: Brain vs. blood
concentration ratio
(serum/plasma).
SVM and ECFP2
%PPB Protein binding percentage
of the compound in plasma.
RF and Descriptor
Vd Volume of distribution
within the body
RF and Descriptor
Fbt Fraction bound in tissues SVM and Descriptor
Ktb Tissue-blood partition
coefficient measure the
distribution of a substance
between a specific tissue
and the blood.
SVM and PubChem FP
Metabolism
Property Definition Used Model and Method
Primary enzyme Predominant enzyme
accountable for metabolism
(CYP P450 1A2, 2C9, 2C19,
2D6, 3A4 etc )
1A2 – SVM and ECFP4
2C9 – RF and ECFP2
2C19 – SVM and ECFP2
2D6 – RF and ECFP4
3A4 – SVM and ECFP4
% metabolised Overall percentage of
metabolism
SVM and MACCS
% excreted The proportion of the
compound excreted
unchanged in urine.
RF and Descriptor
Vmax Maximum velocity of
metabolic reaction
SVM and MACCS
Cliv Clearance rate in liver RF and Descriptor
Excretion/Elemination
Property Definition Used Model and Method
Clr Renal clearance RF and Descriptor
Cltot Total clearance across all
routes
SVM and MACCS key
AUC Area under concentration
time curve
RF and Descriptor
t1⁄2 Half-life: Time for compound
concentration to reduce by
50%
RF and Descriptor
Tmax Time to achieve peak
concentration
RF and Descriptor
Toxicity
Property Definition Used Model and Method
hERG hERG encodes a potassium
ion channel potentially
causing adverse effects on
the heart's electrical activity.
RF and Descriptor and
MACCS
LD50 acute toxicity of a substance,
meaning its potential to
cause harm within a short
period after exposure.
RF and Descriptor
DILI ingestion of a drug or
medication leads to damage,
injury, or dysfunction of the
liver
RF and MACCS key
Hepatotoxicity harmful effects or damage
to the liver caused by drugs
RF and Descriptor
SkinSen skin's response to certain
allergens
RF and MACCS
Model Analysis and Performance
• Predictive Variance : measures prediction variability; high variance
means less precision. Calculation of MAPE (Mean Absolute Percentage Error), MAE
(Mean Absolute Error).
• Model Quality: refers to the effectiveness, reliability, and performance of a
machine learning. Calculation of confusion matrix (Accuracy, Precision, Recall
(Sensitivity), Specificity, F1 Score ).
• Error Analysis :investigate and analyze model errors to identify patterns or
areas where the model may need improvement, then fine-tune the model or collect
more relevant data. Check response times and throughput to ensure the model can
handle the required workload without causing delays
• Model Versioning: keep track of different model versions to understand
which versions are performing best and to facilitate easy rollback in case of issues.
• Scheduled Retraining: set up a retraining schedule to periodically
update the model with new data. This is essential to adapt to changing patterns in the
data.
Model Deployment
• Source Code Management (Git)
• CI/CD ( Jenkins )
• Container (Docker)
• Orchestration (Ansible)
• Log Analysis ( ELK, Grafna)
Model Monitoring
• Data Processing Issue: Data Quality Checks, Data Consistency, Input
Validation, Pipeline Monitoring, Logging and Alerting
• Data Scheme Changes: Validate Incoming data, Automated Alerts,
Data Transformation Monitoring.
• Data Loss at the Source: Recovery Mechanisms, Data Ingestion
Monitoring, Logging and Auditing
• Anomaly Detection : unusual behavior in model outputs or predictions
that may indicate a problem, such as a sudden increase in errors
• Model Documentation : Data Sources, Testing and Validation,
Model Performance
Current Working
• Generate molecule (or similar molecule)
with(almost) desired properties using generative
AI(RNN, GNN etc)
• Checking fit score for compatibility
• Working on automated energy minimisation of
structure.
• Working on DEL, EGFR VIII data analysis
• Working on various different biological data
analysis(NGS, PacBio) project.
Github: https://github.com/santuchal/ADMET
Medium: https://medium.com/@santuchal/admet-an-essential-component-in-drug-
discovery-and-development-f503a5aae5dd
Streamlit: https://hav8whwegtyvgwjixnhxqw.streamlit.app/
THANK YOU

ADMET.pptx

  • 1.
    Optimizing Drug Discoveryusing ADMET Translating Data into Actionable Insights and Decisions using ML Santu Chall ME, MCA
  • 2.
    C10H9NO3 SMILES : SMILES( implified olecular nput ine ntry ystem) is a concise notation for representing chemical structures in a line of text. For example : OC(=O)CN1C(=O)Cc2c1cccc2 Molecular Representation 0D/1D 2D 3D 4D Descriptors : Molecular descriptors are quantitative values that characterize chemical structures, aiding in structure-property relationships and computational chemistry analysis. For example: MW, HBA, HBD, no_of_atom etc. Software : There are various software that can calculate and analyze chemical properties. Such as RDKit, ChemAxon, Dragon, PaDEL, MOE etc etc
  • 3.
    Molecular Fingerprint • BinaryRepresentation of Molecule for fast, objective and compact • “keyed” fingerprint indicates the present or absent of a structural features • Task search and comparison, prediction and clustering • Types of fingerprint • Selecting the right Fingerprint
  • 4.
  • 5.
    Data Selection • OnlineDatabase : ChEMBL, PubChem, ChemDB, ChemSpider, DrugBank etc • Scientific Reputed Journal : Journal of Chemical Information and Modeling, Journal of Cheminformatics, Journal of Computer-Aided Molecular Design etc etc • Data Retrival from the Liturature : PubMed, ScienceDirect, Google Scholar, ACS Publications, Open-access journals etc etc
  • 6.
    Data Division • RandomDivision (train_test_split(X, y, test_size=0.3, random_state=42) • Kennord-Stone Division : Selecting the two data points that are farthest apart in the feature space. • Activity Based Division : Selecting specific activity or property in predicting or modeling. Represent the full range of activity levels in the dataset. • Euclidean Distance Based: Compute the Euclidean distance between all pairs of data in a multidimensional space. (euclidean_distances = np.linalg.norm(X[:, np.newaxis] - X, axis=2) • K-Medois based: Clustering algorithm that divides data into groups. (clusterer = KMedoids(n_clusters=K, random_state=0)
  • 7.
    Feature Selection • GeneticAlgorithm : GA’s feature selection is the process of choosing a subset of the most relevant features (variables) from the original feature set to improve model performance and reduce computational complexity. ga = GeneticAlgorithm(num_features=X.shape[1], fitness_func=fitness_function) • Lasso Feature Selection: Lasso (Least Absolute Shrinkage and Selection Operator) adding a penalty term to the linear regression or logistic regression cost function, which encourages the model to set the coefficients of some features to zero, effectively removing them from the model. lasso = sklearn.linear_model.Lasso(alpha=1.0) • Stepwise Selection: select the most relevant features (Forward Selection, Backward Elimination, Bidirectional Selection, Stopping Criteria) rfe = sklearn.feature_selection.RFE(LogisticRegression(), 10) # Select the top 10 features
  • 8.
    Learning Algorithm • Supervised –Regression - build predictive models for tasks where the goal is to predict a continuous numeric value. Example : Random Forest Regression(RF), Support Vector Regression(SVR),Decision Tree Regression,K-Nearest Neighbors Regression, Neural Networks for Regression etc etc – Classification - build models that categorize data into predefined classes or categories. Example: Logistic Regression, Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN) • Unsupervised – Clustering: used to group data into clusters based on inherent patterns or similarities in the data. Example: K-Means Clustering, X-Means, Gaussian Mixture Models (GMM) – Dimensionally Reduction: used to reduce the number of features or dimensions in a dataset while preserving important information and patterns.Example: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Autoencoders
  • 9.
    Absorption Property Definition UsedModel and Method %Abs Absorption Rate Percentage through the (Intestinal) Barrier RF and MACCS Key %HIA The absorbed percentage through the human GI tract. RF and MACCS Key Caco2 Artificial membrane models predict absorption with paracellular and active transport. RF and Descriptor Pgp Inhibiting P-glycoprotein (P- gp) function to enhance drug absorption. SVM and ECFP4 Amount absorbed Compound absorption weight per kilogram of body weight. RF and Descriptor
  • 10.
    Distribution Property Definition UsedModel and Method BBB partitioning Brain-blood barrier partitioning: Brain vs. blood concentration ratio (serum/plasma). SVM and ECFP2 %PPB Protein binding percentage of the compound in plasma. RF and Descriptor Vd Volume of distribution within the body RF and Descriptor Fbt Fraction bound in tissues SVM and Descriptor Ktb Tissue-blood partition coefficient measure the distribution of a substance between a specific tissue and the blood. SVM and PubChem FP
  • 11.
    Metabolism Property Definition UsedModel and Method Primary enzyme Predominant enzyme accountable for metabolism (CYP P450 1A2, 2C9, 2C19, 2D6, 3A4 etc ) 1A2 – SVM and ECFP4 2C9 – RF and ECFP2 2C19 – SVM and ECFP2 2D6 – RF and ECFP4 3A4 – SVM and ECFP4 % metabolised Overall percentage of metabolism SVM and MACCS % excreted The proportion of the compound excreted unchanged in urine. RF and Descriptor Vmax Maximum velocity of metabolic reaction SVM and MACCS Cliv Clearance rate in liver RF and Descriptor
  • 12.
    Excretion/Elemination Property Definition UsedModel and Method Clr Renal clearance RF and Descriptor Cltot Total clearance across all routes SVM and MACCS key AUC Area under concentration time curve RF and Descriptor t1⁄2 Half-life: Time for compound concentration to reduce by 50% RF and Descriptor Tmax Time to achieve peak concentration RF and Descriptor
  • 13.
    Toxicity Property Definition UsedModel and Method hERG hERG encodes a potassium ion channel potentially causing adverse effects on the heart's electrical activity. RF and Descriptor and MACCS LD50 acute toxicity of a substance, meaning its potential to cause harm within a short period after exposure. RF and Descriptor DILI ingestion of a drug or medication leads to damage, injury, or dysfunction of the liver RF and MACCS key Hepatotoxicity harmful effects or damage to the liver caused by drugs RF and Descriptor SkinSen skin's response to certain allergens RF and MACCS
  • 14.
    Model Analysis andPerformance • Predictive Variance : measures prediction variability; high variance means less precision. Calculation of MAPE (Mean Absolute Percentage Error), MAE (Mean Absolute Error). • Model Quality: refers to the effectiveness, reliability, and performance of a machine learning. Calculation of confusion matrix (Accuracy, Precision, Recall (Sensitivity), Specificity, F1 Score ). • Error Analysis :investigate and analyze model errors to identify patterns or areas where the model may need improvement, then fine-tune the model or collect more relevant data. Check response times and throughput to ensure the model can handle the required workload without causing delays • Model Versioning: keep track of different model versions to understand which versions are performing best and to facilitate easy rollback in case of issues. • Scheduled Retraining: set up a retraining schedule to periodically update the model with new data. This is essential to adapt to changing patterns in the data.
  • 15.
    Model Deployment • SourceCode Management (Git) • CI/CD ( Jenkins ) • Container (Docker) • Orchestration (Ansible) • Log Analysis ( ELK, Grafna)
  • 16.
    Model Monitoring • DataProcessing Issue: Data Quality Checks, Data Consistency, Input Validation, Pipeline Monitoring, Logging and Alerting • Data Scheme Changes: Validate Incoming data, Automated Alerts, Data Transformation Monitoring. • Data Loss at the Source: Recovery Mechanisms, Data Ingestion Monitoring, Logging and Auditing • Anomaly Detection : unusual behavior in model outputs or predictions that may indicate a problem, such as a sudden increase in errors • Model Documentation : Data Sources, Testing and Validation, Model Performance
  • 17.
    Current Working • Generatemolecule (or similar molecule) with(almost) desired properties using generative AI(RNN, GNN etc) • Checking fit score for compatibility • Working on automated energy minimisation of structure. • Working on DEL, EGFR VIII data analysis • Working on various different biological data analysis(NGS, PacBio) project. Github: https://github.com/santuchal/ADMET Medium: https://medium.com/@santuchal/admet-an-essential-component-in-drug- discovery-and-development-f503a5aae5dd Streamlit: https://hav8whwegtyvgwjixnhxqw.streamlit.app/
  • 18.