SlideShare a Scribd company logo
1 of 1
Download to read offline
DEEP LEARNING ON NVIDIA GPUS FOR QSAR, QSPR AND QNAR PREDICTIONS
Boris Sattarov1,2, Artem Mitrofanov1,3, Alexandru Korotcov1, Valery Tkachenko1
1SCIENCE DATA SOFTWARE, LLC, Rockville, MD, United States
2Chemistry Department, Kazan State University, Kazan, Russian Federation
3Chemistry Department, Moscow State University, Moscow, Russian Federation
Abstract
While we have seen a tremendous growth in machine learning methods over the last two
decades there is still no one fits all solution. The next era of cheminformatics and
pharmaceutical research in general is focused on mining the heterogeneous big data, which
is accumulating at ever growing pace, and this will likely use more sophisticated algorithms
such as Deep Learning (DL). There has been increasing use of DL recently which has shown
powerful advantages in learning from images and languages as well as many other areas.
However the accessibly of this technique for cheminformatics is hindered as it is not
available readily to non-experts. It was therefore our goal to develop a DL framework
embedded into a general research data management platform (Open Science Data
Repository) which can be used as an API, standalone tool or integrated in new software as
an autonomous module.
In this work we are sharing the results of comparing performance of classic machine
learning methods (Naïve Bayes, Logistic Regression, Support Vector Machines etc.) with
Deep Learning and discussing challenges associated with Deep Learning Neural Networks
(DNN). The DNN learning models of different complexity (up to 8 hidden layers) were built
and tuned (different number of hidden units per layer, multiple activation functions,
optimizers, drop out fraction, regularization parameters, and learning rate) using Keras
(https://keras.io/) and TensorFlow (www.tensorflow.org) and applied to various use cases
connected to prediction of physicochemical properties, ADME, toxicity and calculating
properties of materials. It was also shown that using NVIDIA GPUs significantly accelerates
calculations, although memory consumption puts some limits on performance and
applicability of standard toolkits 'as is’.
Summary
The lack of a commercial machine learning platform that is accessible for users to build,
validate and test machine learning models based on their own data, is holding back smaller
organizations and academic groups from using these approaches.
The OSDR is built to seamlessly integrate chemical operations, datasets manipulation, and
machine learning models (e.g., DL, as well as classical ML) within one framework. As deep
learning methods have not been widely assessed using prospective validation we have used
an approach of taking previously published and novel data input in software platforms, then
enable building of models, and evaluate.
The datasets used in this study represent whole cell screens, individual proteins,
physicochemical properties as well as a dataset with a complex endpoint. We utilized an
array of model performance evaluation metrics including AUC, F1 score, Cohen’s kappa,
Matthews correlation coefficient and others. SVM models outperformed the other classic
ML models, while DNN performed slightly better than SVM in most cases. The results of this
study show there is still work to be done to improve the models. Also it is apparent that
some of the metrics are less sensitive than others, and it might be more useful to test a
wide set of metrics rather than traditional AUC.
It is worth to mention that the NVIDIA GPUs allow significantly accelerates computations
especially allowing DNN models hyperparametes search within reasonable time frame and
final models training.
Acknowledgements
We thank our collaborators Sean Ekins and Daniel P. Russo from Collaborations Pharmaceuticals who provided insight
and expertise that greatly assisted with this study.
We also thank the entire Science Data Software team for great support and hard work on implementation of Machine
learning pipelines in Open Science Data Repository Software.
Machine learning
Two general prediction pipelines were developed and implemented in OSDR. The first
pipeline solely built using only classic Machine Learning (CML) methods, such as Bernoulli
Naive Bayes, Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, and Support
Vector Machine. Open source Scikit-learn (http://scikit-learn.org/stable/, CPU for training and
prediction) ML python library was used for building, tuning, and validating all CML models
included in this pipeline. The second pipeline was built using Deep Neural Networks (DNN)
learning models of different complexity using Keras (https://keras.io/), a deep learning library,
and TensorFlow(www.tensorflow.org, GPU training and CPU for prediction) as a backend. The
developed pipeline consists of a random splitting of the input dataset into training (80%) and
test (20%) datasets, while maintaining equal proportions of active to inactive class ratios in
each split (stratified splitting). Thus, all the models’ tuning and hyper parameters search
were conducted solely through 4-fold cross validation on training data for better model
generalization.
Figure 1. A 4-layer neural network with four inputs, three hidden layers of 4 neurons each
and one output layer. Notice that connections are between neurons across layers, but not
within a layer.
Deep Neural Networks
Deep artificial neural networks solves the central problem in representation learning by
introducing representations that are expressed in terms of other, simpler representations. N-
layer neural networks are shown in Figure 1.
DNN hyperparameters
Two basic approaches to avoid DNN model overfitting are used in training including L2 norm
and drop out regularization for all hidden layers. The following hyperparameters optimization
was performed using a 3 hidden layers DNN (Keras with TensorFlow backend) and the grid
search method from Scikit-learn:
• optimization algorithm: SGD, Adam, Nadam.
• learning rate: 0.05, 0.025, 0.01, 0.001
• network weight initialization: uniform, lecun_uniform, normal, glorot_normal, he_normal
• hidden layers activation function: relu, tanh, LeakyReLU, SReLU
• output function: softmax, softplus, sigmoid
• L2 regularization: 0.05, 0.01, 0.005, 0.001, 0.0001
• dropout regularization: 0.2, 0.3, 0.5, 0.8
• number of nodes in hidden layer (all hidden layers): 512, 1024, 2048, 4096
The following hyperparameters were used for further DNN training: SGD, learning rate 0.01
(automatically 10% reduced on plateu of 50 epochs), weight initialization he_normal, hidden
layers activation SReLU, output layer function sigmoid, L2 regularization 0.001, dropout is 0.5.
The binary crossentropy was used as a loss function. An early training termination were
implemented by stopping training if no change in loss were observed after 200 epochs. The
number of hidden nodes in all hidden layers were set equal to number of input features
(number of bins in fingerprints).
Computing resources:
Single dual-processor, quad-core (Intel E5640) server running CentOS 7 with 96GB memory
and two NVIDIA Tesla K20c GPU.
Datasets and Descriptors
Diverse drug discovery datasets that are publicly available for different type of activity
prediction were used to develop prediction pipelines. The same datasets have been used in
Clark et al. (J Chem Inf Model 2015) for exploring applicability of an array of Bayesian models
for ADME/Tox and other physicochemical properties prediction. In the current study the
FCFP6 fingerprints, 1024 bins datasets were computed from SDF files using RDKit
(http://www.rdkit.org/).
Data analysis
In this study, several traditional measurements of model performance were used, including
precision, recall, F1-Score, accuracy, Receiver Operating Characteristic (ROC) curve and the
area under it (AUC), Cohen’s Kappa coefficient, and the Matthews correlation coefficient.
Table 1. Representative datasets used for evaluating multiple computational methods for
activity chemical properties prediction.
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
Solubility Huuskonen J. J Chem
Inf Comput Sci 2000
Log solubility = −5 1144 active,
155 inactive,
ratio 7.38
Chagas disease
(Typanosoma
cruzi)
Pubchem BioAssay:
AID 2044
with EC50 <1 μM, >10-fold
difference in cytotoxicity
1692 active,
2363 inactive,
ratio 0.72
Figure 2. Representative polar plots of the model evaluation metrics for the Solubility and
Chagas disease training and testing datasets. BNB - Bernoulli Naive Bayes, LLR - Logistic linear
regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support Vector
Machines, DNN-N (N is number of hidden layers).
Solubility
Chagas disease
Science Data
Software

More Related Content

What's hot

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataGreg Landrum
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel
 
Multi-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorMulti-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorLevi Waldron
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Ahmad C. Bukhari
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
Rescuing Data from Decaying and Moribund Clinical Information Systems
Rescuing Data from Decaying and Moribund Clinical Information SystemsRescuing Data from Decaying and Moribund Clinical Information Systems
Rescuing Data from Decaying and Moribund Clinical Information SystemsHealth Informatics New Zealand
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...Syed Ahmad Chan Bukhari, PhD
 
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...Syed Ahmad Chan Bukhari, PhD
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 

What's hot (20)

OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methods
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Multi-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorMulti-omics methods and resources for Bioconductor
Multi-omics methods and resources for Bioconductor
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
Rescuing Data from Decaying and Moribund Clinical Information Systems
Rescuing Data from Decaying and Moribund Clinical Information SystemsRescuing Data from Decaying and Moribund Clinical Information Systems
Rescuing Data from Decaying and Moribund Clinical Information Systems
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 

Similar to Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...
Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...
Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...IRJET Journal
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindsporeijdms
 
Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...
Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...
Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...IJERA Editor
 
1st review android malware.pptx
1st review  android malware.pptx1st review  android malware.pptx
1st review android malware.pptxNambiraju
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Network Intrusion Detection System using Machine Learning
Network Intrusion Detection System using Machine LearningNetwork Intrusion Detection System using Machine Learning
Network Intrusion Detection System using Machine LearningIRJET Journal
 
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSSENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSIJDKP
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...ijaia
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
 
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...IJCNCJournal
 
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...IJCNCJournal
 
Scientific
Scientific Scientific
Scientific marpierc
 
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting PneumoniaIRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting PneumoniaIRJET Journal
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptopRising Media, Inc.
 

Similar to Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions (20)

[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...
Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...
Metaphorical Analysis of diseases in Tomato leaves using Deep Learning Algori...
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...
Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...
Automatic Selection of Open Source Multimedia Softwares Using Error Back-Prop...
 
1st review android malware.pptx
1st review  android malware.pptx1st review  android malware.pptx
1st review android malware.pptx
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Network Intrusion Detection System using Machine Learning
Network Intrusion Detection System using Machine LearningNetwork Intrusion Detection System using Machine Learning
Network Intrusion Detection System using Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSSENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
 
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
 
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
 
Scientific
Scientific Scientific
Scientific
 
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting PneumoniaIRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptop
 

More from Valery Tkachenko

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureValery Tkachenko
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materialsValery Tkachenko
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Valery Tkachenko
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsValery Tkachenko
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsValery Tkachenko
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionValery Tkachenko
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
 
Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...Valery Tkachenko
 
Living in a world of federated knowledge challenges, principles, tools and ...
Living in a world of federated knowledge   challenges, principles, tools and ...Living in a world of federated knowledge   challenges, principles, tools and ...
Living in a world of federated knowledge challenges, principles, tools and ...Valery Tkachenko
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataValery Tkachenko
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchValery Tkachenko
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardizationValery Tkachenko
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsValery Tkachenko
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical informationValery Tkachenko
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesValery Tkachenko
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction databaseValery Tkachenko
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 

More from Valery Tkachenko (20)

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the future
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materials
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representations
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpoints
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collection
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...Development and comparison of deep learning toolkit with other machine learni...
Development and comparison of deep learning toolkit with other machine learni...
 
Living in a world of federated knowledge challenges, principles, tools and ...
Living in a world of federated knowledge   challenges, principles, tools and ...Living in a world of federated knowledge   challenges, principles, tools and ...
Living in a world of federated knowledge challenges, principles, tools and ...
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical data
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials research
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and Learnings
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical information
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spaces
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction database
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 

Recently uploaded

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions

  • 1. DEEP LEARNING ON NVIDIA GPUS FOR QSAR, QSPR AND QNAR PREDICTIONS Boris Sattarov1,2, Artem Mitrofanov1,3, Alexandru Korotcov1, Valery Tkachenko1 1SCIENCE DATA SOFTWARE, LLC, Rockville, MD, United States 2Chemistry Department, Kazan State University, Kazan, Russian Federation 3Chemistry Department, Moscow State University, Moscow, Russian Federation Abstract While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this work we are sharing the results of comparing performance of classic machine learning methods (Naïve Bayes, Logistic Regression, Support Vector Machines etc.) with Deep Learning and discussing challenges associated with Deep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 8 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and TensorFlow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using NVIDIA GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is’. Summary The lack of a commercial machine learning platform that is accessible for users to build, validate and test machine learning models based on their own data, is holding back smaller organizations and academic groups from using these approaches. The OSDR is built to seamlessly integrate chemical operations, datasets manipulation, and machine learning models (e.g., DL, as well as classical ML) within one framework. As deep learning methods have not been widely assessed using prospective validation we have used an approach of taking previously published and novel data input in software platforms, then enable building of models, and evaluate. The datasets used in this study represent whole cell screens, individual proteins, physicochemical properties as well as a dataset with a complex endpoint. We utilized an array of model performance evaluation metrics including AUC, F1 score, Cohen’s kappa, Matthews correlation coefficient and others. SVM models outperformed the other classic ML models, while DNN performed slightly better than SVM in most cases. The results of this study show there is still work to be done to improve the models. Also it is apparent that some of the metrics are less sensitive than others, and it might be more useful to test a wide set of metrics rather than traditional AUC. It is worth to mention that the NVIDIA GPUs allow significantly accelerates computations especially allowing DNN models hyperparametes search within reasonable time frame and final models training. Acknowledgements We thank our collaborators Sean Ekins and Daniel P. Russo from Collaborations Pharmaceuticals who provided insight and expertise that greatly assisted with this study. We also thank the entire Science Data Software team for great support and hard work on implementation of Machine learning pipelines in Open Science Data Repository Software. Machine learning Two general prediction pipelines were developed and implemented in OSDR. The first pipeline solely built using only classic Machine Learning (CML) methods, such as Bernoulli Naive Bayes, Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, and Support Vector Machine. Open source Scikit-learn (http://scikit-learn.org/stable/, CPU for training and prediction) ML python library was used for building, tuning, and validating all CML models included in this pipeline. The second pipeline was built using Deep Neural Networks (DNN) learning models of different complexity using Keras (https://keras.io/), a deep learning library, and TensorFlow(www.tensorflow.org, GPU training and CPU for prediction) as a backend. The developed pipeline consists of a random splitting of the input dataset into training (80%) and test (20%) datasets, while maintaining equal proportions of active to inactive class ratios in each split (stratified splitting). Thus, all the models’ tuning and hyper parameters search were conducted solely through 4-fold cross validation on training data for better model generalization. Figure 1. A 4-layer neural network with four inputs, three hidden layers of 4 neurons each and one output layer. Notice that connections are between neurons across layers, but not within a layer. Deep Neural Networks Deep artificial neural networks solves the central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations. N- layer neural networks are shown in Figure 1. DNN hyperparameters Two basic approaches to avoid DNN model overfitting are used in training including L2 norm and drop out regularization for all hidden layers. The following hyperparameters optimization was performed using a 3 hidden layers DNN (Keras with TensorFlow backend) and the grid search method from Scikit-learn: • optimization algorithm: SGD, Adam, Nadam. • learning rate: 0.05, 0.025, 0.01, 0.001 • network weight initialization: uniform, lecun_uniform, normal, glorot_normal, he_normal • hidden layers activation function: relu, tanh, LeakyReLU, SReLU • output function: softmax, softplus, sigmoid • L2 regularization: 0.05, 0.01, 0.005, 0.001, 0.0001 • dropout regularization: 0.2, 0.3, 0.5, 0.8 • number of nodes in hidden layer (all hidden layers): 512, 1024, 2048, 4096 The following hyperparameters were used for further DNN training: SGD, learning rate 0.01 (automatically 10% reduced on plateu of 50 epochs), weight initialization he_normal, hidden layers activation SReLU, output layer function sigmoid, L2 regularization 0.001, dropout is 0.5. The binary crossentropy was used as a loss function. An early training termination were implemented by stopping training if no change in loss were observed after 200 epochs. The number of hidden nodes in all hidden layers were set equal to number of input features (number of bins in fingerprints). Computing resources: Single dual-processor, quad-core (Intel E5640) server running CentOS 7 with 96GB memory and two NVIDIA Tesla K20c GPU. Datasets and Descriptors Diverse drug discovery datasets that are publicly available for different type of activity prediction were used to develop prediction pipelines. The same datasets have been used in Clark et al. (J Chem Inf Model 2015) for exploring applicability of an array of Bayesian models for ADME/Tox and other physicochemical properties prediction. In the current study the FCFP6 fingerprints, 1024 bins datasets were computed from SDF files using RDKit (http://www.rdkit.org/). Data analysis In this study, several traditional measurements of model performance were used, including precision, recall, F1-Score, accuracy, Receiver Operating Characteristic (ROC) curve and the area under it (AUC), Cohen’s Kappa coefficient, and the Matthews correlation coefficient. Table 1. Representative datasets used for evaluating multiple computational methods for activity chemical properties prediction. Model Datasets used and references Cutoff for active Number of molecules and ratio Solubility Huuskonen J. J Chem Inf Comput Sci 2000 Log solubility = −5 1144 active, 155 inactive, ratio 7.38 Chagas disease (Typanosoma cruzi) Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold difference in cytotoxicity 1692 active, 2363 inactive, ratio 0.72 Figure 2. Representative polar plots of the model evaluation metrics for the Solubility and Chagas disease training and testing datasets. BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support Vector Machines, DNN-N (N is number of hidden layers). Solubility Chagas disease Science Data Software