The document summarizes research using disease association data from Open Targets to predict novel drug targets. A positive-unlabeled learning approach was used to train classifiers on features from Open Targets data. The best-performing neural network achieved 71% accuracy on the test set. Predictions were validated through literature mining and showed targets with clearer disease links had higher predictive scores. While limitations exist, the results demonstrate machine learning can aid drug target discovery by predicting targets from gene-disease association data.
Prediction of novel targets using disease association data
1. Prediction of novel targets using
disease association data from
Open Targets
Enrico Ferrero, PhD, Associate GSK Fellow
Scientific Leader, Computational Biology, Target Sciences
GSK
BioData World Congress
03.11.2017
@enricoferrero
2. Data + AI = drugs?
BBC News, 2017 Nature Biotechnology, 2017
3. The pharma AI space is getting crowded
Partner
Partner
6. Rethink the drug discovery pipeline
Manhattan Institute, 2012
Late phase
failures cost
(a lot) more
Spend more time
and resources in
target discovery
Reduce
attrition in
later phases
7. But how do we find good targets?
Nelson et al., Nat Genet, 2015
9. Could it be as easy as spotting spam emails?
▪ Is it possible to predict novel therapeutic targets using available
gene – disease association data?
▪ Is Open Targets just a catalogue of gene – disease associations
or can we learn from it what makes a good target?
10. A positive – unlabelled (PU) semi-
supervised learning approach
▪ Obtain all gene – disease associations and supporting evidence from Open
Targets platform. For all genes, create numeric features by taking the
mean score across all diseases:
▪ Genetic associations (germline)
▪ Somatic mutations
▪ Significant gene expression changes
▪ Disease-relevant phenotype in animal model
▪ Pathway-level evidence
▪ Gather positive labels from Pharmaprojects: only consider targets with
drugs currently on the market, in clinical trials or preclinical studies. A
semi-supervised framework with only positive labels is used: targets
according to PharmaProjects constitute the positive class (P), while the
rest of the proteome is used as the unlabelled class (U), containing both
negatives and yet-to-be-discovered positive.
▪ All positive cases (1421) and an equal number of randomly selected
unlabelled cases (2842 in total) are set apart for training (80%) and
testing (20%). The remainder is kept as a prediction set where predictions
from the final model will be made.
11. Finding structure and most important features
t-SNE dimensionality reduction
reveals structured observations
Most important features
according to chi-squared test and
information gain
12. Nested cross-validation and bagging for
tuning and model selection
Bischl et al., 2012
Wikipedia
Four classifiers are independently tuned, trained and tested on the training
set using a nested cross-validation strategy (4 inner rounds for parameter
tuning and 4 outer rounds to assess performance):
▪ Random forest
▪ Feed-forward neural network with single hidden layer
▪ Support vector machine with radial kernel
▪ Gradient boosting machine with AdaBoost exponential loss
function
In PU learning, U contains both positive and negative cases, which results in classifier
instability. Bagging (bootstrap aggregating) can improve the performance of instable
classifiers by randomly resampling P and U with replacement (bootstrap) and then
aggregating the results by majority voting:
▪ Bagging with 100 iterations was applied to the neural network, the support vector
machine and the gradient boosting machine.
▪ Random forests are already a special case of bagging.
13. Assessing performance and investigating results
Neural network classifier
achieves 71% accuracy
(0.76 AUC) on test set
More advanced targets
have higher disease
association evidence
14. Validation of predictions with literature mining
Significant overlap between neural
network predictions and text mining
results (p = 5.05e-172)
15. Automating drug target discovery
with machine learning
▪ The gene – disease association data from Open Targets contains enough
information to predict whether a protein can make a therapeutic target or
not with decent accuracy.
▪ According to our model, the most informative evidence types are animal
models showing disease-relevant phenotypes, dysregulated gene
expression in disease tissue and genetic associations between gene and
disease.
▪ The ability to predict late stage targets with greater accuracy confirms that
clear linkage between target and disease is essential to maximise chances
of success in the clinic.
▪ Limitations:
▪ Lack of prediction on indication;
▪ No tractability considerations.
16. Thank you!
▪ Philippe Sanseau
▪ Ian Dunham
▪ Gautier Koscielny
▪ Giovanni Dall’Olio
▪ Pankaj Agarwal
▪ Mark Hurle
▪ Steven Barrett
▪ Nicola Richmond
▪ Jin Yao