The presentation was given at SETAC 2022 Nov 16 and describes our work on Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic Toxicity.
We generated many models that are available to license in our MegaTox software. We found that the support vector machines performed the best after assessing many algorithms for both classification and regression models.
The authors of this work are Thomas R Lane, Fabio Urbina and Sean Ekins.
The contact is sean@collaborationspharma.com
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic Toxicity
1. 1
Evaluating Multiple Machine Learning Models for
Biodegradation and Aquatic Toxicity
Sean Ekins,Thomas R. Lane and Fabio Urbina
2. 2
Biodegradation
The ability of a material to decompose
after interactions with biological
elements
Aquatic Toxicity
Toxicity of industrial chemicals to
organisms living in the water body to
which the chemicals are discharged
What is Biodegradation and Aquatic Toxicity?
Mansouri et al., PMID: 29520515
Teixodóet al., PMID: 31598995
3. 3
Globally, industrial chemicals end up in all types of bodies
of water, including freshwater
Testing the biodegradation and aquatic toxicity for every
compound is unrealistic - requires alternative
toxicological methods
One alternative is prediction of these properties
using Machine Learning models
Importance of Biodegradation and Aquatic Toxicity
& Challenges
4. 4
Multiple algorithms
(Deep Neural Networks, k
Nearest Neighbors, Bernoulli
naïve Bayes, Linear Logistic
Regression, AdaBoost Decision
Tree, Random Forest, XGBoost,
Support Vector Machine, Elastic
Net Regression)
Fingerprint descriptors (ECFP4-8)
with nested, 5-fold cross-validation
Applicability domain is calculated
based on the reliability-density
neighborhood (RDN) method
Machine Learning: Assay Central ® Introduction
Lane et. al., PMID: 33325717
5. 5
Organism Acute Toxicity Values Chronic Toxicity Values
Fish (Aquatic Vertebrate) 96-hour LC50 Chronic Value (ChV)
Daphnid (Aquatic Invertebrate) 48-hour LC50 Chronic Value (ChV)
Algae (Aquatic Plant) 72- or 96-hour EC50 Chronic Value (ChV)
Biodegradation: compounds classified using OECD guidelines (ECHA REACH,
EPA’s Biowin software, literature)
Classification à 3428 unique compounds (962 RB); Regression à ~200
Ecotoxicity: Multiple sources for acute and chronic ecotoxicity (ECOTOX, Ministry
of the Environment of Japan, literature) using EPA’s ECOSAR defined testing
parameters:
Data Mining: Biodegradation and Ecotoxicity Datasets
6. 6
Classification models
Prediction of readily/non-biodegradable
All algorithms performed well using
nested, 5-fold cross-validation (CV),
with SVC outperforming others
Regression models
Prediction of ”ultimate” biodegradation of
compounds
CV suggested a predictive model, with
low MAE/RMSE and a moderate R2 (SVR
example)
Biodegradation Classification and Regression Models
7. 7
Aquatic Acute Toxicity Classification Models
Organism Endpoint High (≤1 mg/L) Low (≥100 mg/L) Total Compounds
Fish LD50 880 664 2983
Daphnid LD50 347 484 1379
Green Algae EC50 390 126 1130
Thresholds based on the EPA’s ECOSAR defined toxicity
Aquatic toxicity concern level
High Concern: any acute value <1 mg/L
Moderate Concern: lowest acute value between >1 and < 100mg/L
Low Concern: all acute are >100 mg/L
Only the most sensitive species is considered
10. 10
Organism Endpoint High (≤0.1 mg/L) Low (≥10 mg/L) Total Compounds
Fish ChV 458 217 1087
Daphnid ChV 231 80 566
Green Algae Chv 77 90 321
Aquatic Chronic Toxicity Classification Models
Data from the EPA’s ECOTOX website and literature
ChV requires calculation of the geometric mean of NOEC & LOEC
LOEC = lowest observed effect concentration, NOEC = no observed effect concentration
13. 13
Summary
• Biodegradation Models - SVC performs the best
• Aquatic chronic and acute toxicity models - Datasets from EPA’s
ECOTOX and literature
• Support Vector Machine outperformed the other algorithms
• Future: Many more descriptors to try!
• Implementation of conformal predictors to add a reliable
confidence scoring system
Sheffield and Judson PMID: 31560848
14. 14
Improving the Quality of Predictions
- Conformal Predictors for Biodegradation
Angelopoulos and Bates arXiv:2107.07511
Uses a calibration dataset to determine optimal prediction score threshold for each class
15. 15
Josh Harris
Scott Snyder
Discussions with:
Diedrich Bermudez
Daniel Mucs
Funding
NIGMS: R44GM122196-04A1
NIEHS: 1R43ES031038-02A1
Contact me at:
sean@collaborationspharma.com
Acknowledgments