While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
1. DEEP LEARNING ON NVIDIA GPUS FOR QSAR, QSPR AND QNAR PREDICTIONS
Boris Sattarov1,2, Artem Mitrofanov1,3, Alexandru Korotcov1, Valery Tkachenko1
1SCIENCE DATA SOFTWARE, LLC, Rockville, MD, United States
2Chemistry Department, Kazan State University, Kazan, Russian Federation
3Chemistry Department, Moscow State University, Moscow, Russian Federation
Abstract
While we have seen a tremendous growth in machine learning methods over the last two
decades there is still no one fits all solution. The next era of cheminformatics and
pharmaceutical research in general is focused on mining the heterogeneous big data, which
is accumulating at ever growing pace, and this will likely use more sophisticated algorithms
such as Deep Learning (DL). There has been increasing use of DL recently which has shown
powerful advantages in learning from images and languages as well as many other areas.
However the accessibly of this technique for cheminformatics is hindered as it is not
available readily to non-experts. It was therefore our goal to develop a DL framework
embedded into a general research data management platform (Open Science Data
Repository) which can be used as an API, standalone tool or integrated in new software as
an autonomous module.
In this work we are sharing the results of comparing performance of classic machine
learning methods (Naïve Bayes, Logistic Regression, Support Vector Machines etc.) with
Deep Learning and discussing challenges associated with Deep Learning Neural Networks
(DNN). The DNN learning models of different complexity (up to 8 hidden layers) were built
and tuned (different number of hidden units per layer, multiple activation functions,
optimizers, drop out fraction, regularization parameters, and learning rate) using Keras
(https://keras.io/) and TensorFlow (www.tensorflow.org) and applied to various use cases
connected to prediction of physicochemical properties, ADME, toxicity and calculating
properties of materials. It was also shown that using NVIDIA GPUs significantly accelerates
calculations, although memory consumption puts some limits on performance and
applicability of standard toolkits 'as is’.
Summary
The lack of a commercial machine learning platform that is accessible for users to build,
validate and test machine learning models based on their own data, is holding back smaller
organizations and academic groups from using these approaches.
The OSDR is built to seamlessly integrate chemical operations, datasets manipulation, and
machine learning models (e.g., DL, as well as classical ML) within one framework. As deep
learning methods have not been widely assessed using prospective validation we have used
an approach of taking previously published and novel data input in software platforms, then
enable building of models, and evaluate.
The datasets used in this study represent whole cell screens, individual proteins,
physicochemical properties as well as a dataset with a complex endpoint. We utilized an
array of model performance evaluation metrics including AUC, F1 score, Cohen’s kappa,
Matthews correlation coefficient and others. SVM models outperformed the other classic
ML models, while DNN performed slightly better than SVM in most cases. The results of this
study show there is still work to be done to improve the models. Also it is apparent that
some of the metrics are less sensitive than others, and it might be more useful to test a
wide set of metrics rather than traditional AUC.
It is worth to mention that the NVIDIA GPUs allow significantly accelerates computations
especially allowing DNN models hyperparametes search within reasonable time frame and
final models training.
Acknowledgements
We thank our collaborators Sean Ekins and Daniel P. Russo from Collaborations Pharmaceuticals who provided insight
and expertise that greatly assisted with this study.
We also thank the entire Science Data Software team for great support and hard work on implementation of Machine
learning pipelines in Open Science Data Repository Software.
Machine learning
Two general prediction pipelines were developed and implemented in OSDR. The first
pipeline solely built using only classic Machine Learning (CML) methods, such as Bernoulli
Naive Bayes, Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, and Support
Vector Machine. Open source Scikit-learn (http://scikit-learn.org/stable/, CPU for training and
prediction) ML python library was used for building, tuning, and validating all CML models
included in this pipeline. The second pipeline was built using Deep Neural Networks (DNN)
learning models of different complexity using Keras (https://keras.io/), a deep learning library,
and TensorFlow(www.tensorflow.org, GPU training and CPU for prediction) as a backend. The
developed pipeline consists of a random splitting of the input dataset into training (80%) and
test (20%) datasets, while maintaining equal proportions of active to inactive class ratios in
each split (stratified splitting). Thus, all the models’ tuning and hyper parameters search
were conducted solely through 4-fold cross validation on training data for better model
generalization.
Figure 1. A 4-layer neural network with four inputs, three hidden layers of 4 neurons each
and one output layer. Notice that connections are between neurons across layers, but not
within a layer.
Deep Neural Networks
Deep artificial neural networks solves the central problem in representation learning by
introducing representations that are expressed in terms of other, simpler representations. N-
layer neural networks are shown in Figure 1.
DNN hyperparameters
Two basic approaches to avoid DNN model overfitting are used in training including L2 norm
and drop out regularization for all hidden layers. The following hyperparameters optimization
was performed using a 3 hidden layers DNN (Keras with TensorFlow backend) and the grid
search method from Scikit-learn:
• optimization algorithm: SGD, Adam, Nadam.
• learning rate: 0.05, 0.025, 0.01, 0.001
• network weight initialization: uniform, lecun_uniform, normal, glorot_normal, he_normal
• hidden layers activation function: relu, tanh, LeakyReLU, SReLU
• output function: softmax, softplus, sigmoid
• L2 regularization: 0.05, 0.01, 0.005, 0.001, 0.0001
• dropout regularization: 0.2, 0.3, 0.5, 0.8
• number of nodes in hidden layer (all hidden layers): 512, 1024, 2048, 4096
The following hyperparameters were used for further DNN training: SGD, learning rate 0.01
(automatically 10% reduced on plateu of 50 epochs), weight initialization he_normal, hidden
layers activation SReLU, output layer function sigmoid, L2 regularization 0.001, dropout is 0.5.
The binary crossentropy was used as a loss function. An early training termination were
implemented by stopping training if no change in loss were observed after 200 epochs. The
number of hidden nodes in all hidden layers were set equal to number of input features
(number of bins in fingerprints).
Computing resources:
Single dual-processor, quad-core (Intel E5640) server running CentOS 7 with 96GB memory
and two NVIDIA Tesla K20c GPU.
Datasets and Descriptors
Diverse drug discovery datasets that are publicly available for different type of activity
prediction were used to develop prediction pipelines. The same datasets have been used in
Clark et al. (J Chem Inf Model 2015) for exploring applicability of an array of Bayesian models
for ADME/Tox and other physicochemical properties prediction. In the current study the
FCFP6 fingerprints, 1024 bins datasets were computed from SDF files using RDKit
(http://www.rdkit.org/).
Data analysis
In this study, several traditional measurements of model performance were used, including
precision, recall, F1-Score, accuracy, Receiver Operating Characteristic (ROC) curve and the
area under it (AUC), Cohen’s Kappa coefficient, and the Matthews correlation coefficient.
Table 1. Representative datasets used for evaluating multiple computational methods for
activity chemical properties prediction.
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
Solubility Huuskonen J. J Chem
Inf Comput Sci 2000
Log solubility = −5 1144 active,
155 inactive,
ratio 7.38
Chagas disease
(Typanosoma
cruzi)
Pubchem BioAssay:
AID 2044
with EC50 <1 μM, >10-fold
difference in cytotoxicity
1692 active,
2363 inactive,
ratio 0.72
Figure 2. Representative polar plots of the model evaluation metrics for the Solubility and
Chagas disease training and testing datasets. BNB - Bernoulli Naive Bayes, LLR - Logistic linear
regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support Vector
Machines, DNN-N (N is number of hidden layers).
Solubility
Chagas disease
Science Data
Software