The document discusses using random forest ensembles to predict cytotoxicity and animal toxicity using screening data from the MLSCN. Key points:
- Random forest models were built to predict cytotoxicity using Scripps and NCGC screening data, achieving up to 71% accuracy. Features identified focused on toxic compounds.
- Models struggled to predict animal toxicity data, achieving only 29% accuracy overall. Differences in feature distributions between screening and animal data made transfer learning difficult.
- Larger, more standardized datasets are needed along with consideration of mechanisms of action to improve predictive ability across domains. Standardization of assay data in PubChem poses challenges for utilizing it for modeling.
IOSR Journal of Pharmacy and Biological Sciences(IOSR-JPBS) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of Pharmacy and Biological Science. The journal welcomes publications of high quality papers on theoretical developments and practical applications in Pharmacy and Biological Science. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
IOSR Journal of Pharmacy and Biological Sciences(IOSR-JPBS) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of Pharmacy and Biological Science. The journal welcomes publications of high quality papers on theoretical developments and practical applications in Pharmacy and Biological Science. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
Presented at 15th International Conference on BioInformatics and BioEngineering (BIBE2014)
Prognostic modeling is central to medicine, as it is often used to predict patients’ outcome and response to treatments and to identify important medical risk factors. Logistic regression is one of the most used approaches for clinical pre- diction modeling. Traumatic brain injury (TBI) is an important public health issue and a leading cause of death and disability worldwide. In this study, we adapt CPXR (Contrast Pattern Aided Regression, a recently introduced regression method), to develop a new logistic regression method called CPXR(Log), for general binary outcome prediction (including prognostic modeling), and we use the method to carry out prognostic modeling for TBI using admission time data. The models produced by CPXR(Log) achieved AUC as high as 0.93 and specificity as high as 0.97, much better than those reported by previous studies. Our method produced interpretable prediction models for diverse patient groups for TBI, which show that different kinds of patients should be evaluated differently for TBI outcome prediction and the odds ratios of some predictor variables differ significantly from those given by previous studies; such results can be valuable to physicians.
In silico prediction of small molecule properties is widely used today in industry and academia. NMR spectra, in particular, are predicted by a variety of software packages. In this array of software options, two main approaches are used:
Database-based. Compounds are compared against a database, the result is calculated using data for close structural relatives found in the dataset.
Regression-based. An experimental database is used to calculate the parameters for non-linear regression. The chemical shift is calculated by a non-linear function of variables which describe characteristic features of the molecule of interest.
These two outlined approaches require different strategies for implementation and optimization. Database-based results are improved by acquiring larger databases and/or including user-specific data into the calculation. Non-linear regression algorithms can be improved through the regression itself, or by improving the structural descriptors
Molecular modelling for in silico drug discoveryLee Larcombe
A slide set based on the small molecule section of "Introduction to in silico drug discovery" with more detail on molecular modelling and simulation aspects. Including a bit more on protein structure prediction
Increasingly Accurate Representation of Biochemistry (v2)Michel Dumontier
Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate.
A comparison of three chromatographic retention time prediction modelsAndrew McEachran
High resolution mass spectrometry (HRMS) data has revolutionized the identification of environmental contaminants through non-targeted analyses (NTA). However, data processing and chemical identification and prioritization remain challenging due to the vast number of unknowns observed in NTA. The ideal NTA workflow requires harmonized data and tools from a variety of sources to allow the most probable and confirmed identifications. One such tool is the use of chromatographic retention time (RT). Comparing predicted RT of candidate structures to observed RT allows for additional specificity towards ultimate identification. In this work, three RT prediction models were evaluated on the same set of chemicals: 1) a logP-based model, 2) a model generated in ACD/ChromGenius, and 3) a Quantitative Structure Retention Relationship model, OPERA-RT. Our results indicate that both ACD/ChromGenius and OPERA-RT outperform the logP-based model. Between the two, OPERA-RT produced a slightly better fit on the entire set of structures than ACD/ChromGenius (R2 values of 0.85 to 0.83). Further, OPERA-RT, generated within the US EPA’s National Center for Computational Toxicology, predicted 96% of RTs within a 15% (+/-) chromatographic time window of experimental RTs. Finally, to test an NTA workflow, candidate structures were generated for formulae in the test set using the US EPA’s CompTox Chemistry Dashboard and RTs for all candidates were predicted using both ACD/ChromGenius and OPERA-RT. RT screening windows were applied to screen out unlikely candidate chemicals and enhance potential identification. Compared to ACD/ChromGenius, OPERA-RT screened out a greater percentage of the candidate structures by RT, but retained fewer of the known chemicals. This research demonstrates the potential value of including RT prediction in NTA workflows and indicates the potential value of OPERA-RT predictions to support our NTA investigations. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Structure elucidation using 2D NMR data and application of traditional methods of structure elucidation is known to fail for certain problems. In this work it is shown that Computer-Assisted Structure Elucidation (CASE) methods are capable of solving such problems. We conclude that it is now impossible to evaluate the capabilities of novel NMR experimental techniques in isolation from expert systems developed for processing fuzzy, incomplete and contradictory information obtained from 2D NMR spectra.
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of Biological sequences is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as meta genomes. The present Biological sequence compression algorithms work by finding similar repeated regions within the Biological sequence and then encode these repeated regions together for compression. The previous research on chromosome sequence similarity reveals that the length of similar repeated regions within one chromosome is about 4.5% of the total sequence length. The compression gain is often not high because of these short lengths of repeated regions. It is well recognized that similarities exist among different regions of chromosome sequences. This implies that similar repeated sequences are found among different regions of chromosome sequences. Here, we apply cross-chromosomal similarity for a Biological sequence compression. The length and location of similar repeated regions among the different Biological sequences are studied. It is found that the average percentage of similar subsequences found between two chromosome sequences is about 10% in which 8% comes from cross-chromosomal prediction and 2% from self-chromosomal prediction. The percentage of similar subsequences is about 18% in which only 1.2% comes from self-chromosomal prediction while the rest is from cross-chromosomal prediction among the different Biological sequences studied. This suggests the significance of cross-chromosomal similarities in addition to self-chromosomal similarities in the Biological sequence compression. An additional 23% of storage space could be reduced on average using self-chromosomal and cross-chromosomal predictions in compressing the different Biological sequences.
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
Presented at 15th International Conference on BioInformatics and BioEngineering (BIBE2014)
Prognostic modeling is central to medicine, as it is often used to predict patients’ outcome and response to treatments and to identify important medical risk factors. Logistic regression is one of the most used approaches for clinical pre- diction modeling. Traumatic brain injury (TBI) is an important public health issue and a leading cause of death and disability worldwide. In this study, we adapt CPXR (Contrast Pattern Aided Regression, a recently introduced regression method), to develop a new logistic regression method called CPXR(Log), for general binary outcome prediction (including prognostic modeling), and we use the method to carry out prognostic modeling for TBI using admission time data. The models produced by CPXR(Log) achieved AUC as high as 0.93 and specificity as high as 0.97, much better than those reported by previous studies. Our method produced interpretable prediction models for diverse patient groups for TBI, which show that different kinds of patients should be evaluated differently for TBI outcome prediction and the odds ratios of some predictor variables differ significantly from those given by previous studies; such results can be valuable to physicians.
In silico prediction of small molecule properties is widely used today in industry and academia. NMR spectra, in particular, are predicted by a variety of software packages. In this array of software options, two main approaches are used:
Database-based. Compounds are compared against a database, the result is calculated using data for close structural relatives found in the dataset.
Regression-based. An experimental database is used to calculate the parameters for non-linear regression. The chemical shift is calculated by a non-linear function of variables which describe characteristic features of the molecule of interest.
These two outlined approaches require different strategies for implementation and optimization. Database-based results are improved by acquiring larger databases and/or including user-specific data into the calculation. Non-linear regression algorithms can be improved through the regression itself, or by improving the structural descriptors
Molecular modelling for in silico drug discoveryLee Larcombe
A slide set based on the small molecule section of "Introduction to in silico drug discovery" with more detail on molecular modelling and simulation aspects. Including a bit more on protein structure prediction
Increasingly Accurate Representation of Biochemistry (v2)Michel Dumontier
Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate.
A comparison of three chromatographic retention time prediction modelsAndrew McEachran
High resolution mass spectrometry (HRMS) data has revolutionized the identification of environmental contaminants through non-targeted analyses (NTA). However, data processing and chemical identification and prioritization remain challenging due to the vast number of unknowns observed in NTA. The ideal NTA workflow requires harmonized data and tools from a variety of sources to allow the most probable and confirmed identifications. One such tool is the use of chromatographic retention time (RT). Comparing predicted RT of candidate structures to observed RT allows for additional specificity towards ultimate identification. In this work, three RT prediction models were evaluated on the same set of chemicals: 1) a logP-based model, 2) a model generated in ACD/ChromGenius, and 3) a Quantitative Structure Retention Relationship model, OPERA-RT. Our results indicate that both ACD/ChromGenius and OPERA-RT outperform the logP-based model. Between the two, OPERA-RT produced a slightly better fit on the entire set of structures than ACD/ChromGenius (R2 values of 0.85 to 0.83). Further, OPERA-RT, generated within the US EPA’s National Center for Computational Toxicology, predicted 96% of RTs within a 15% (+/-) chromatographic time window of experimental RTs. Finally, to test an NTA workflow, candidate structures were generated for formulae in the test set using the US EPA’s CompTox Chemistry Dashboard and RTs for all candidates were predicted using both ACD/ChromGenius and OPERA-RT. RT screening windows were applied to screen out unlikely candidate chemicals and enhance potential identification. Compared to ACD/ChromGenius, OPERA-RT screened out a greater percentage of the candidate structures by RT, but retained fewer of the known chemicals. This research demonstrates the potential value of including RT prediction in NTA workflows and indicates the potential value of OPERA-RT predictions to support our NTA investigations. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Structure elucidation using 2D NMR data and application of traditional methods of structure elucidation is known to fail for certain problems. In this work it is shown that Computer-Assisted Structure Elucidation (CASE) methods are capable of solving such problems. We conclude that it is now impossible to evaluate the capabilities of novel NMR experimental techniques in isolation from expert systems developed for processing fuzzy, incomplete and contradictory information obtained from 2D NMR spectra.
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of Biological sequences is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as meta genomes. The present Biological sequence compression algorithms work by finding similar repeated regions within the Biological sequence and then encode these repeated regions together for compression. The previous research on chromosome sequence similarity reveals that the length of similar repeated regions within one chromosome is about 4.5% of the total sequence length. The compression gain is often not high because of these short lengths of repeated regions. It is well recognized that similarities exist among different regions of chromosome sequences. This implies that similar repeated sequences are found among different regions of chromosome sequences. Here, we apply cross-chromosomal similarity for a Biological sequence compression. The length and location of similar repeated regions among the different Biological sequences are studied. It is found that the average percentage of similar subsequences found between two chromosome sequences is about 10% in which 8% comes from cross-chromosomal prediction and 2% from self-chromosomal prediction. The percentage of similar subsequences is about 18% in which only 1.2% comes from self-chromosomal prediction while the rest is from cross-chromosomal prediction among the different Biological sequences studied. This suggests the significance of cross-chromosomal similarities in addition to self-chromosomal similarities in the Biological sequence compression. An additional 23% of storage space could be reduced on average using self-chromosomal and cross-chromosomal predictions in compressing the different Biological sequences.
Similar to Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection (20)
The design of chemical libraries is usually informed by pre-existing characteristics and desired features. On the other hand, assesing the prospective performance of a new library is more difficult. Importantly, a given screening library is often screened in a variety of systems which can differ in cell lines, readouts, formats and so on. In this study we explore to what extent pre-existing libraries can shed light on the relation between library activity and assay features. Using an ontology such as the BAO, it is possible to construct a hierarchy of annotations associated with an assay. Based on this annotation hierarchy we can then ask how likely are molecules associated with a specific annotation, to be identified as active. To allow generalization we consider substrucural features, as represented by a structural key fingerprint, rather than whole molecules. We employ a Bayesian framework to quantify the the association between a substructural feature and a given assay annotation, using a set of NCGC assays that have been annotated with BAO terms. We discuss our approach to training the Bayesian model and describe benchmarks that characterize model performance relative to the position of the annotation in the BAO hierarchy. Finally we discuss the role of this approach in a library design workflow that includes traditional design features such as chemical space coverage and physicochemical properties but also takes in to account screening platform features.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Full-RAG: A modern architecture for hyper-personalization
Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection
1. Random Forest Ensembles Applied to
MLSCN Screening Data for Prediction
and Feature Selection
Rajarshi Guha
School of Informatics
Indiana University
and
Stephan Schürer
Scripps, FL
23rd August, 2007
2. Broad Goals
Understand and possibly predict cytotoxicity
− Utilizing MLSCN screening data and external data
− Characterize and visualize various screening results
− Relate screening data to known information
Model and predict acute toxicity in animals
− Relate large cytotoxicty data sets to animal toxicity(?)
Modelling protocols to handle the characteristics of HTS data
− Large datasets, imbalanced classes, applicability
Make models publicly available
− For use in multiple scenarios and accessible by a variety of methods
3. Datasets
Animal Acute Toxicity Data was extracted from the
ToxNet database (available from MDL)
− Selected only LD50 data for mouse and rat and three routes of
administration
− Summarized LD50 data by structure, species and route
(140,808 LD50 data points, 103,040 structures)
− Classified into Toxic/Nontoxic using a cutoff
Cytotoxicity Data was taken as published in PubChem
from Scripps and NCGC
− Scripps Jurkat cytotoxicity assay
(59,805 structures with %Inhib, 801 IC50 values)
− NCGC data from PubChem for 13 cell lines (non-MLSMR structures):
summarized multiple sample data by unique structures and extracted IC50
data: 1,334 structure, 13 x 1,334 IC50 values for different cell lines
4. Scripps Cytotoxicity Models
57,469 valid structures
775 structures with measured IC50
− Skipped 26 structures that BCI could not parse
How do we model this dataset?
− Use all data. Very poor results
− Use a sampling procedure to get an ensemble of
models
− Consider just the 775 structures
5. Scripps Cytotoxicity Models
First considered the 775 structures
Evaluated 1052 bit BCI fingerprints
Selected a cutoff pIC50
− >= 5.5 - toxic
− < 5.5 – nontoxic
Used sampling to
create 10-member
ensemble
7. Scripps Cytotoxicity Models
% correct
(ensemble average) = 69%
% correct
(consensus) = 71%
Nontoxic Toxic
Nontoxic 39 17
Toxic 15 41
Not very good
performance
8. Do More Negatives Help?
Include 10,000 structures, randomly selected
− Primary data, assumed to be nontoxic
Selected a cutoff pIC50
− >= 5.0 - toxic
− < 5.0 – nontoxic
Used sampling to
create 10-member
ensemble
9. Expanded Cytotoxicity Dataset
% correct (averaged over
the ensemble) = 69%
% correct (consensus
prediction) = 71%
Nontoxic Toxic
Nontoxic 79 24
Toxic 35 68
Not much
improvement
Insufficient sampling
of the nontoxics
10. We Need More Positives
The two datasets (775 vs 10,775 compounds)
are quite similar in terms of bit spectrum
Normalized Manhattan distance = 0.016
11. Selecting Important Features
Identifying the important features in an ensemble
− Each RF model can rank the input features
− A consistent ensemble should have similar, but not
necessarily identical, features highly ranked
− Consider the unique set of top N
− The size of the unique set of top N features provides
information about the robustness of the ensemble
4 4 3 4 Unique set of
3 8 23 5 top 4 features
15 16 15 28 3 4 5 8 15 16 23 28
23 15 8 23
Ranked features from each
member of the ensemble
12. Important Structural Features
The 10 most important features for predictive
ability across the ensemble leads to 43 unique
important bits
This is a total of 66 structural features
− The toxic compounds are characterized by having a
slightly larger number of these features, on average
13. Predicting Animal Toxicity
Should we use cytotoxicity model to predict
animal toxicity?
Normalized Manhattan distance = 0.037
15. NCGC Toxicity Dataset
Considered 13 cell lines, pIC50's
1334 compounds, including
− metals
− inorganics
Classified into toxic / nontoxic using a cutoff
− mean + 2 * SD
Built models for each cell line
16. NCGC – Class Distributions
Cutoff values ranged from 3.56 to 4.72
Classes are severely imbalanced
Developed ensembles of RF models
20. NCGC – Using the Models
Predicted toxicity class for the Scripps
Cytotoxicity dataset (775 compounds) using
model built for NCGC Jurkat cell line
Nontoxic Toxic
Predictions for the Scripps
Nontoxic 67 49 Cytotox dataset, using the
Toxic 432 227 original cutoffs (32% correct)
Nontoxic Toxic Predictions for the Scripps
Nontoxic 26 90 Cytotox dataset, using the
Toxic 109 550 NCGC cutoff (75% correct)
21. Comparing NCGC & Scripps Datasets
Comparing the datasets as a whole
23. Important Features
We consider the NCGC Jurkat cell line
The 10 most important features for predictive
ability across the ensemble leads to 53 unique
important bits
This is a total of 72 structural features
− The toxic compounds are characterized by having a
larger number of these features, on average
24. Feature matches for example structure
O
O
O H
C H
3
O H
C H
3
H O O
O H
O
O H C O O
3
C H
3
O
H C
3
O H
Digoxin
O H
[C;D2](-;@[C,c])-;@[C,c]
[O,o;D2](~[C,c])~[C,c] [O,o]!@[C,c]@[C,c]@[C,c]!@[C,c]
[*]-;@[A]-;@[A]=;@[A]-;!@[*]
[O,o]~[C,c]~[O,o]~[C,c]~[C,c]~[O,o] [*]!@[*]@[*]@[*]@[*]!@[*]
[O,o]!@[C,c]@[C,c]@[C,c]@[C,c]!@[O,o]
[*]1@[*]@[*]@[*]@[*]@[*]@1
[O,o,S,s,Se,Te,Po]!@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]
25. Important Features
Animal Toxicity vs. Cytotoxicity
The ToxNet (Mouse/IP) and NCGC Jurkat models
have 130 important features in common
These features are more common in the NCGC toxic
compounds than in the NCGC nontoxic compounds
The average number of these features present in the
NCGC dataset, overall, is 18.8
− Very low, might indicate that the NCGC model is not going to
be applicable to the ToxNet data
26. Toxicity vs No. of Features - NCGC Dataset
C 3
H C 3
H
H3
C C 3
H
H3
C O O O
O N
O
C 3
H N NH2
S H
C
H3 N N O HN O
H H3C O
O O CH3
N NH O NH
N O
N
H3
C C 3
H
O CH3 O
H3C
H3
C N N
H3
C
Dactinomycin
H N
N N H
2
O N
O
CH3
Leukerin deriv. HO
OH O
O
HO
HCl
H 2N O OH O O
CH 3
HO
O
Daunorubicin
CH 3
The important features C H
3
focus on the toxic class
No correlation between number of
features and pIC50 for nontoxic class C H
3
27. Predicting Animal Toxicity
Nontoxic Toxic Predictions for the ToxNet
Nontoxic 12182 558 Mouse/IP dataset.
Toxic 32638 1265 29% correct overall.
70% correct on the toxic class
Overall predictive performance is poor
Possible causes
− Poor sampling of the nontoxics during training
− Feature distributions between the two datasets
29. Whats Next?
Relate structural features to mechanisms of toxicity
Incorporate these into models / build class models
− Different cell-lines vs. animal toxicity
− Structural features vs. mechanisms?
Based on prediction confidence and model
applicability, can we suggest alternative assays?
Use the vote fraction & common bit counts to prioritize
compounds, which may be toxic
− Improve assessment of model applicability
30. Summary
Applying models to predict other datasets is a
tricky affair
− Are the features distributed in a similar manner
between training data & the new data?
− Do toxic/nontoxic labels transfer between datasets?
More secondary data required
− But this is not the final solution since the NCGC
dataset is small but leads to (some) good models
31. Summary
Fingerprints may not be the optimal way to get the best
predictive ability
− They do let us look at structural features easily
We have investigated Molconn-Z descriptors
− Preliminary results don't indicate significant improvements
We cannot globally model animal toxicity based on
cytotoxicity
− Animal data sets are biased to toxic compounds
− Different structural classes behave differently (mechanism of
action, metabolic effects
32. Acknowledgements
MLSCN data sets / PubChem
NCGC
Scripps
− Screening (Peter Hodder)
− Informatics (Nick Tsinoremas, Chris Mader)
− Hugh Rosen
Alex Tropsha, UNC
Digital Chemistry
Tudor Oprea, UNM
NIH
34. Structure Sets: Fingerprint Similarity
Only a small fraction of MLSMR structures
are similar to ToxNet structures; and vice
versa; 4 to 5 % of MLSMR and ToxNet have
at least one >50 % similar structure to each
other
NCGC structures are much more similar to
ToxNet (86% >50 % max similar) than
MLSMR (9% >50 % max similar)
35. Important Features - Distributions
[*]1@[*]@[*]@[*]@[*]@[*]@1
[N,n]~[C,c]~[C,c]~[C,c]
[C,c]~[N,n]~[C,c]~[C,c]
[A]=;@[A]-;@[A]-;@[A]=;@[A]
[C,c;D1]-;!@[N,n]
36. Are The Cytotox Classes Distinct?
Poor predictive ability may be explained by the
lack of separation between toxic & nontoxic
Normalized Manhattan distance = 0.017
37. Are The Cytotox Classes Distinct?
But the situation is a little better if we just look
at the important bits
Normalized Manhattan distance = 0.06
38. Standardization Issues - Data
Extracting data sets out of PubChem requires manual
curation and post-processing and aggregation of data
− No standard measures or column definitions
− Activity score and outcome only valid within one experiment
− Assay results are not globally comparable
− No standardization of assay format (e.g. type, readout, etc.)
− Limited ability to query PubChem for specific data sets
rpubchem package for R is one option
− Need better way to access specific bulk data sets
− No aggregation of assay (sample) data by compound
PubChem seems better suited to browse individual data
than access large standardized data sets
39. Model Deployment
Final models are deployed in our R WS
infrastructure
− Currently the Scripps Jurkat model is available
Model file can be downloaded
− http://www.chembiogrid.org/cheminfo/rws/mlist
A web page client is available at
− http://www.chembiogrid.org/cheminfo/rws/scripps
Incorporated the model into a Pipeline Pilot
workflow
40. Toxicity vs No. of Features - Mouse/IP
OH O
H O H C
HC 3
3
H C
3
H CH3
O
H O
O H
N
CH3
HC
3 H N H
H C
H 3
O O
H H
O H C
O H O 3
H3C H
HO C H
O 3
H
H O
O
H H O
H Batrachotoxin
H
O
H NH2
HN
H
O
H HN
O
H
H O CH3
O HN
O H
H
H OH O N HN O
H N
H
O
NH O S
Ciguatoxin CTX 3C N
N
H
HO O O N
HN
HN O S O O
No correlation between O
OH N
NH2
the number of important HN NH
H
S O
H2N
features and the pLD50 O NH
O
for the toxic class OH O S
NH2
Conotoxin GI