Text mining the relations between chemicals and proteins is an increasingly important task. The CHEMPROT track at BioCreative VI aims to promote the development and evaluation of systems that can automatically detect the chemical-protein relations in running text (PubMed abstracts). This manuscript describes our submission, which is an ensemble of three systems, including a Support Vector Machine, a Convolutional Neural Network, and a Recurrent Neural Network. Their output is combined using a decision based on majority voting or stacking. Our CHEMPROT system obtained 0.7266 in precision and 0.5735 in recall for an f-score of 0.6410, demonstrating the effectiveness of machine learning-based approaches for automatic relation extraction from biomedical literature. Our submission achieved the highest performance in the task during the 2017 challenge.
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models
1. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 1
Chemical-protein relation
extraction with ensembles of SVM,
CNN, and RNN models
Yifan Peng1, Anthony Rios1,2, Ramakanth Kavuluru2,3, Zhiyong Lu1
1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
2Department of Computer Science, University of Kentucky
3Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky
2. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 2
• Individual models
• SVM
• CNN
• RNN
• Ensembles of three models
• Majority voting
• Stacking
• Experiments
• 5-fold cross validation on training + dev set
• Results on test set
Outline
3. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 3
• A multiclass classification problem
• The chemical-protein relations occurring in a single sentence
Chemical-protein relations
4. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 4
SVM
• Linear kernel
• One-vs-rest scheme
Rich Feature Vector
• Words/Part-of-speech tags surrounding the chemical and gene mentions
• Bag-of-words between the chemical and gene mentions
• Distance between two entity mentions
• Shortest path in a dependency graph
SVM with rich feature vector
Miwa, M.; Sætre, R.; Miyao, Y. & Tsujii, J. A rich
feature vector for protein-protein interaction
extraction from multiple corpora. Proceedings of the
2009 Conference on Empirical Methods in Natural
Language Processing, 2009, 1, 121-130
5. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 5
• Obtained using Bllip parser + Stanford dependencies converter
• Vertex walks
• CHEMICAL – nsubj – inhibits
• inhibits – dobj – induction
• induction – nmod:of – GENE
• Edge walks
• nsubj – inhibits – dobj
• dobj – induction – nmod:of
Shortest path in a dependency graph
6. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 6
Convolutional Neural Network
Peng, Y. & Lu, Z. Deep
learning for extracting
protein-protein interactions
from biomedical literature.
Proceedings of BioNLP 2017,
2017, 29-38
7. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 7
Convolutional Neural Network
• Word embedding: 300
• trained on PubMed using word2vec
• Part-of-speech, chunk and named entities: one-hot encoding
• Obtained using Genia Tagger
• Convolutional window size: 3 and 5
• Filters: 300
8. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 8
Recurrent Neural Network
Kavuluru, R.; Rios, A. & Tran, T. Extracting
Drug-Drug Interactions with Word and
Character-Level Recurrent Neural
Networks. 2017 IEEE International
Conference on Healthcare Informatics
(ICHI), 2017, 5-12
9. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 9
Recurrent Neural Network
• Pairwise ranking loss
• The output layer has 5 positive classes
• If all 5 class scores are negative, then we predict the negative class
• Preprocessing
• Replace word occurs less than 5 times with an UNK token
• Word embedding: 300
• Obtained from GloVe
Santos, C. N.; Xiang, B. & Zhou, B.
Classifying Relations by Ranking with
Convolutional Neural Networks. ACL, 2015,
626-634
10. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 10
Ensembles of SVM, CNN, and RNN models
11. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 11
Majority voting
• Select the relations that are predicted by more than 2 models
Stacking
• Random Forest classifier
• 17 features:
• 6 from SVM
• 6 from CNN
• 5 from RNN (pariwise ranking loss)
Ensembles of SVM, CNN, and RNN models
12. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 12
• Combine training and development sets
• 5-fold cross validation
• 60% for training
• 20% for development (also used to train the stacking systems)
• 20% for test
Results for 5-fold cross validation
13. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 13
Models P R F
SVM 0.629 0.478 0.543
CNN 0.641 0.571 0.602
RNN 0.608 0.614 0.609
Majority voting 0.741 0.552 0.632
Stacking 0.755 0.552 0.638
Results of 5-fold cross validation
14. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 14
Run System P R F
1 Majority Voting 0.7437 0.5529 0.6343
2 Majority Voting 0.7283 0.5503 0.6269
3 Stacking 0.7426 0.5382 0.6241
4 Stacking 0.7311 0.5685 0.6397
5 Stacking 0.7266 0.5735 0.6410
Results on test set
15. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 15
System P R F
5-fold CV Majority voting 0.7408 0.5517 0.6319
Stacking 0.7554 0.5524 0.6378
Testing Majority Voting 0.7437 0.5529 0.6343
Stacking 0.7266 0.5735 0.6410
Results on test set
16. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 16
Summary
• Ensemble systems of three models: SVM, CNN, and RNN
• Results are consistent on training + development set and on the test set
• Ensemble methods improved the precisions
• Performance of CNN and RNN are comparable
Future work
• Error analysis
• Fair comparisons between CNN and RNN
• Effects of different parts of deep learning models
Summary and future work
17. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 17
• The organizers of the BioCreative VI CHEMPROT task
• Members
• Yifan Peng, NCBI
• Anthony Rios, Department of Computer Science, University of Kentucky
• Ramakanth Kavuluru, Department of Internal Medicine, University of Kentucky
• Zhiyong Lu, NCBI
Acknowledgement
18. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 18
Thank You!
yifan.peng@nih.gov
Editor's Notes
I am Yifan Peng from the bio text mining group at NCBI. Our group participated in the chemical protein relation extraction task.
Luckily, our submissions achieved the highest performance in the task during the 2017 challenge
I am glad to have a chance to report our ensemble system and share some observerations.
This is architecture of our system.
I will start with three individual models we built for this task. SVM, CNN and RNN
How we built the ensembles of these three models.
Resutls by 5-fold CV on the training and development sets
5-fold cross validation to reduce the variety on the dataset and better understand how these model preform on the data.
We built the svm with rich feature vector. Here we use a liniear kernel and one-vs-one scheme. The feature was selected based Miwa’s work in 2009.
a walk is defined as alternating sequences of vertices and edges, vi, ei, i+ 1, vi+ 1, ei+ 1,i+ 2, …, vi+n− 1, beginning with a vertex and ending with a vertex. The length of a walk is the number of edges that it uses. We take into consideration walks of length 1
e-walk that starts and ends with an edge (e.g. ei, i+ 1, vi+ 1, ei+ 1,i+ 2). It is actually not a walk defined in the graph theory but an ad hoc concept to provide a syntactic structure for the learning model because contexts by syntactic relations are crucial.
There are two input layers. One is matrix of whole sentence. The other is a matrix of shortest path between two entities.
each word in either a sentence or a shortest path is represented by concatenating embeddings of its words, part-of-speech tags, chunks, named entities, dependencies, and positions relatively to two mentions of interest.
dimensionality
Max pooling across hidden state of LSTM
We concatenate the two representations to obtain the final representation of each word.
To obtain a representation of the sentence, we use max-over-time (1-max) pooling across hidden state word representations.
the output layer only has 5 classes, where we completely discard the negative class. Specifically, we use the pairwise ranking loss proposed by Santos et al (16). At prediction time, if all 5 class scores are negative, then we predict the negative class. Otherwise, we predict the class with the largest positive score.
There are many aspects we need to exploere after BioCreative VI
That ’s how we built three models. Now I will describe two ensemble methods to combine the results from 3 individual models
One way is by majority voting. So If more than 2 models predict the same relation type, we pick it. Otherwise, they are negative.
The other way is using stacking. We trained a random forest classifier using all the predications of three models.
For SVM: we use the distance of the samples X to the separating hyperplane.
For CNN, we use the unnormalized scores for each class before passing them through the softmax function
For RNN, we use the scores after softmax
To reduce variability, 5-fold cross-validation was performed using different partitions of the data.
Improve the SVM if we spend more time on feature engineering.
Results of CNN and RNN are comparable on F-score
Using the ensemble system, either majority voting or stacking, the precision is greatly improved by at least 10 percent. As the result, the f-score is also improved.
So the ensemble really pushed us over the top in final submissions
The results are quite consistent on the training+dev and on the test set.
The second highest precision is ~0.67.
The second highest f-score is 0.6141.
Results by 5-fold cross validation
The results are quite consistent on the training+dev and on the test set.
The model didn’t encounter overfitting.