SlideShare a Scribd company logo
1 of 18
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 1
Chemical-protein relation
extraction with ensembles of SVM,
CNN, and RNN models
Yifan Peng1, Anthony Rios1,2, Ramakanth Kavuluru2,3, Zhiyong Lu1
1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
2Department of Computer Science, University of Kentucky
3Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 2
• Individual models
• SVM
• CNN
• RNN
• Ensembles of three models
• Majority voting
• Stacking
• Experiments
• 5-fold cross validation on training + dev set
• Results on test set
Outline
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 3
• A multiclass classification problem
• The chemical-protein relations occurring in a single sentence
Chemical-protein relations
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 4
SVM
• Linear kernel
• One-vs-rest scheme
Rich Feature Vector
• Words/Part-of-speech tags surrounding the chemical and gene mentions
• Bag-of-words between the chemical and gene mentions
• Distance between two entity mentions
• Shortest path in a dependency graph
SVM with rich feature vector
Miwa, M.; Sætre, R.; Miyao, Y. & Tsujii, J. A rich
feature vector for protein-protein interaction
extraction from multiple corpora. Proceedings of the
2009 Conference on Empirical Methods in Natural
Language Processing, 2009, 1, 121-130
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 5
• Obtained using Bllip parser + Stanford dependencies converter
• Vertex walks
• CHEMICAL – nsubj – inhibits
• inhibits – dobj – induction
• induction – nmod:of – GENE
• Edge walks
• nsubj – inhibits – dobj
• dobj – induction – nmod:of
Shortest path in a dependency graph
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 6
Convolutional Neural Network
Peng, Y. & Lu, Z. Deep
learning for extracting
protein-protein interactions
from biomedical literature.
Proceedings of BioNLP 2017,
2017, 29-38
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 7
Convolutional Neural Network
• Word embedding: 300
• trained on PubMed using word2vec
• Part-of-speech, chunk and named entities: one-hot encoding
• Obtained using Genia Tagger
• Convolutional window size: 3 and 5
• Filters: 300
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 8
Recurrent Neural Network
Kavuluru, R.; Rios, A. & Tran, T. Extracting
Drug-Drug Interactions with Word and
Character-Level Recurrent Neural
Networks. 2017 IEEE International
Conference on Healthcare Informatics
(ICHI), 2017, 5-12
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 9
Recurrent Neural Network
• Pairwise ranking loss
• The output layer has 5 positive classes
• If all 5 class scores are negative, then we predict the negative class
• Preprocessing
• Replace word occurs less than 5 times with an UNK token
• Word embedding: 300
• Obtained from GloVe
Santos, C. N.; Xiang, B. & Zhou, B.
Classifying Relations by Ranking with
Convolutional Neural Networks. ACL, 2015,
626-634
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 10
Ensembles of SVM, CNN, and RNN models
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 11
Majority voting
• Select the relations that are predicted by more than 2 models
Stacking
• Random Forest classifier
• 17 features:
• 6 from SVM
• 6 from CNN
• 5 from RNN (pariwise ranking loss)
Ensembles of SVM, CNN, and RNN models
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 12
• Combine training and development sets
• 5-fold cross validation
• 60% for training
• 20% for development (also used to train the stacking systems)
• 20% for test
Results for 5-fold cross validation
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 13
Models P R F
SVM 0.629 0.478 0.543
CNN 0.641 0.571 0.602
RNN 0.608 0.614 0.609
Majority voting 0.741 0.552 0.632
Stacking 0.755 0.552 0.638
Results of 5-fold cross validation
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 14
Run System P R F
1 Majority Voting 0.7437 0.5529 0.6343
2 Majority Voting 0.7283 0.5503 0.6269
3 Stacking 0.7426 0.5382 0.6241
4 Stacking 0.7311 0.5685 0.6397
5 Stacking 0.7266 0.5735 0.6410
Results on test set
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 15
System P R F
5-fold CV Majority voting 0.7408 0.5517 0.6319
Stacking 0.7554 0.5524 0.6378
Testing Majority Voting 0.7437 0.5529 0.6343
Stacking 0.7266 0.5735 0.6410
Results on test set
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 16
Summary
• Ensemble systems of three models: SVM, CNN, and RNN
• Results are consistent on training + development set and on the test set
• Ensemble methods improved the precisions
• Performance of CNN and RNN are comparable
Future work
• Error analysis
• Fair comparisons between CNN and RNN
• Effects of different parts of deep learning models
Summary and future work
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 17
• The organizers of the BioCreative VI CHEMPROT task
• Members
• Yifan Peng, NCBI
• Anthony Rios, Department of Computer Science, University of Kentucky
• Ramakanth Kavuluru, Department of Internal Medicine, University of Kentucky
• Zhiyong Lu, NCBI
Acknowledgement
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 18
Thank You!
yifan.peng@nih.gov

More Related Content

Similar to Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models

NGS-Based Clinical Analysis
NGS-Based Clinical AnalysisNGS-Based Clinical Analysis
NGS-Based Clinical AnalysisDelaina Hawkins
 
Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Yifan Peng
 
Cncp 2010
Cncp 2010Cncp 2010
Cncp 2010ygc
 
5th RNA-Seq San Francisco Agenda
5th RNA-Seq San Francisco Agenda5th RNA-Seq San Francisco Agenda
5th RNA-Seq San Francisco AgendaDiane McKenna
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Alejandro Borges
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPatricia Francis-Lyon
 
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Setia Pramana
 
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Thermo Fisher Scientific
 
STRING - Prediction of a functional association network for the yeast mitocho...
STRING - Prediction of a functional association network for the yeast mitocho...STRING - Prediction of a functional association network for the yeast mitocho...
STRING - Prediction of a functional association network for the yeast mitocho...Lars Juhl Jensen
 
FFPE Applications Solutions brochure
FFPE Applications Solutions brochureFFPE Applications Solutions brochure
FFPE Applications Solutions brochureAffymetrix
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaquesSHAPE Society
 

Similar to Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models (20)

Cignal webina
Cignal webinaCignal webina
Cignal webina
 
NGS-Based Clinical Analysis
NGS-Based Clinical AnalysisNGS-Based Clinical Analysis
NGS-Based Clinical Analysis
 
Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...
 
Cncp 2010
Cncp 2010Cncp 2010
Cncp 2010
 
5th RNA-Seq San Francisco Agenda
5th RNA-Seq San Francisco Agenda5th RNA-Seq San Francisco Agenda
5th RNA-Seq San Francisco Agenda
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
Giab agbt SVs_2019
Giab agbt SVs_2019Giab agbt SVs_2019
Giab agbt SVs_2019
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
 
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
 
STRING - Prediction of a functional association network for the yeast mitocho...
STRING - Prediction of a functional association network for the yeast mitocho...STRING - Prediction of a functional association network for the yeast mitocho...
STRING - Prediction of a functional association network for the yeast mitocho...
 
FFPE Applications Solutions brochure
FFPE Applications Solutions brochureFFPE Applications Solutions brochure
FFPE Applications Solutions brochure
 
CV 2015
CV 2015CV 2015
CV 2015
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 
Micro array study for gene expression in vp
Micro array study for gene expression in vpMicro array study for gene expression in vp
Micro array study for gene expression in vp
 
Predicting Pharmacology
Predicting PharmacologyPredicting Pharmacology
Predicting Pharmacology
 

Recently uploaded

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 

Recently uploaded (20)

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 

Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models

  • 1. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 1 Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models Yifan Peng1, Anthony Rios1,2, Ramakanth Kavuluru2,3, Zhiyong Lu1 1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health 2Department of Computer Science, University of Kentucky 3Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky
  • 2. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 2 • Individual models • SVM • CNN • RNN • Ensembles of three models • Majority voting • Stacking • Experiments • 5-fold cross validation on training + dev set • Results on test set Outline
  • 3. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 3 • A multiclass classification problem • The chemical-protein relations occurring in a single sentence Chemical-protein relations
  • 4. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 4 SVM • Linear kernel • One-vs-rest scheme Rich Feature Vector • Words/Part-of-speech tags surrounding the chemical and gene mentions • Bag-of-words between the chemical and gene mentions • Distance between two entity mentions • Shortest path in a dependency graph SVM with rich feature vector Miwa, M.; Sætre, R.; Miyao, Y. & Tsujii, J. A rich feature vector for protein-protein interaction extraction from multiple corpora. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, 1, 121-130
  • 5. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 5 • Obtained using Bllip parser + Stanford dependencies converter • Vertex walks • CHEMICAL – nsubj – inhibits • inhibits – dobj – induction • induction – nmod:of – GENE • Edge walks • nsubj – inhibits – dobj • dobj – induction – nmod:of Shortest path in a dependency graph
  • 6. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 6 Convolutional Neural Network Peng, Y. & Lu, Z. Deep learning for extracting protein-protein interactions from biomedical literature. Proceedings of BioNLP 2017, 2017, 29-38
  • 7. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 7 Convolutional Neural Network • Word embedding: 300 • trained on PubMed using word2vec • Part-of-speech, chunk and named entities: one-hot encoding • Obtained using Genia Tagger • Convolutional window size: 3 and 5 • Filters: 300
  • 8. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 8 Recurrent Neural Network Kavuluru, R.; Rios, A. & Tran, T. Extracting Drug-Drug Interactions with Word and Character-Level Recurrent Neural Networks. 2017 IEEE International Conference on Healthcare Informatics (ICHI), 2017, 5-12
  • 9. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 9 Recurrent Neural Network • Pairwise ranking loss • The output layer has 5 positive classes • If all 5 class scores are negative, then we predict the negative class • Preprocessing • Replace word occurs less than 5 times with an UNK token • Word embedding: 300 • Obtained from GloVe Santos, C. N.; Xiang, B. & Zhou, B. Classifying Relations by Ranking with Convolutional Neural Networks. ACL, 2015, 626-634
  • 10. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 10 Ensembles of SVM, CNN, and RNN models
  • 11. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 11 Majority voting • Select the relations that are predicted by more than 2 models Stacking • Random Forest classifier • 17 features: • 6 from SVM • 6 from CNN • 5 from RNN (pariwise ranking loss) Ensembles of SVM, CNN, and RNN models
  • 12. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 12 • Combine training and development sets • 5-fold cross validation • 60% for training • 20% for development (also used to train the stacking systems) • 20% for test Results for 5-fold cross validation
  • 13. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 13 Models P R F SVM 0.629 0.478 0.543 CNN 0.641 0.571 0.602 RNN 0.608 0.614 0.609 Majority voting 0.741 0.552 0.632 Stacking 0.755 0.552 0.638 Results of 5-fold cross validation
  • 14. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 14 Run System P R F 1 Majority Voting 0.7437 0.5529 0.6343 2 Majority Voting 0.7283 0.5503 0.6269 3 Stacking 0.7426 0.5382 0.6241 4 Stacking 0.7311 0.5685 0.6397 5 Stacking 0.7266 0.5735 0.6410 Results on test set
  • 15. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 15 System P R F 5-fold CV Majority voting 0.7408 0.5517 0.6319 Stacking 0.7554 0.5524 0.6378 Testing Majority Voting 0.7437 0.5529 0.6343 Stacking 0.7266 0.5735 0.6410 Results on test set
  • 16. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 16 Summary • Ensemble systems of three models: SVM, CNN, and RNN • Results are consistent on training + development set and on the test set • Ensemble methods improved the precisions • Performance of CNN and RNN are comparable Future work • Error analysis • Fair comparisons between CNN and RNN • Effects of different parts of deep learning models Summary and future work
  • 17. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 17 • The organizers of the BioCreative VI CHEMPROT task • Members • Yifan Peng, NCBI • Anthony Rios, Department of Computer Science, University of Kentucky • Ramakanth Kavuluru, Department of Internal Medicine, University of Kentucky • Zhiyong Lu, NCBI Acknowledgement
  • 18. Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models 18 Thank You! yifan.peng@nih.gov

Editor's Notes

  1. I am Yifan Peng from the bio text mining group at NCBI. Our group participated in the chemical protein relation extraction task. Luckily, our submissions achieved the highest performance in the task during the 2017 challenge I am glad to have a chance to report our ensemble system and share some observerations.
  2. This is architecture of our system. I will start with three individual models we built for this task. SVM, CNN and RNN How we built the ensembles of these three models. Resutls by 5-fold CV on the training and development sets 5-fold cross validation to reduce the variety on the dataset and better understand how these model preform on the data.
  3. We built the svm with rich feature vector. Here we use a liniear kernel and one-vs-one scheme. The feature was selected based Miwa’s work in 2009.
  4. a walk is defined as alternating sequences of vertices and edges, vi, ei, i+ 1, vi+ 1, ei+ 1,i+ 2, …, vi+n− 1, beginning with a vertex and ending with a vertex. The length of a walk is the number of edges that it uses. We take into consideration walks of length 1 e-walk that starts and ends with an edge (e.g. ei, i+ 1, vi+ 1, ei+ 1,i+ 2). It is actually not a walk defined in the graph theory but an ad hoc concept to provide a syntactic structure for the learning model because contexts by syntactic relations are crucial. 
  5. There are two input layers. One is matrix of whole sentence. The other is a matrix of shortest path between two entities. each word in either a sentence or a shortest path is represented by concatenating embeddings of its words, part-of-speech tags, chunks, named entities, dependencies, and positions relatively to two mentions of interest.
  6. dimensionality
  7. Max pooling across hidden state of LSTM We concatenate the two representations to obtain the final representation of each word. To obtain a representation of the sentence, we use max-over-time (1-max) pooling across hidden state word representations.
  8. the output layer only has 5 classes, where we completely discard the negative class. Specifically, we use the pairwise ranking loss proposed by Santos et al (16). At prediction time, if all 5 class scores are negative, then we predict the negative class. Otherwise, we predict the class with the largest positive score. There are many aspects we need to exploere after BioCreative VI
  9. That ’s how we built three models. Now I will describe two ensemble methods to combine the results from 3 individual models
  10. One way is by majority voting. So If more than 2 models predict the same relation type, we pick it. Otherwise, they are negative. The other way is using stacking. We trained a random forest classifier using all the predications of three models. For SVM: we use the distance of the samples X to the separating hyperplane. For CNN, we use the unnormalized scores for each class before passing them through the softmax function For RNN, we use the scores after softmax
  11. To reduce variability, 5-fold cross-validation was performed using different partitions of the data.
  12. Improve the SVM if we spend more time on feature engineering. Results of CNN and RNN are comparable on F-score Using the ensemble system, either majority voting or stacking, the precision is greatly improved by at least 10 percent. As the result, the f-score is also improved. So the ensemble really pushed us over the top in final submissions
  13. The results are quite consistent on the training+dev and on the test set. The second highest precision is ~0.67. The second highest f-score is 0.6141.
  14. Results by 5-fold cross validation The results are quite consistent on the training+dev and on the test set. The model didn’t encounter overfitting.