SlideShare a Scribd company logo
EVALUATING DISTRIBUTIONAL SEMANTIC
AND FEATURE SELECTION FOR
EXTRACTING RELATIONSHIPS FROM
BIOLOGICAL TEXT

Ehsan Emadzadeh*
Siddhartha Jonnalagadda †
Graciela Gonzalez*
*Department of Biomedical Informatics, Arizona State
University
†Department of Health Sciences Research, Mayo Clinic
PROBLEMS
 What are the useful features for biological relation
  extraction?
 Word level semantic features like distributional
  semantic contribution for relationship extraction is
  unknown
 Which method for calculating semantic features is
  the best?




                                                         3
CORPUS
          We used BioNLP 2011 GENIA corpus
          Training set includes 800 abstracts and 5 full papers
          Test set includes 150 abstracts and 5 full papers

4000

3500

3000

2500

2000

1500

1000                                                        Train set
 500                                                        Test set
   0




                                                                   4
OUR CONTRIBUTION
 Evaluated distributional semantic features and
  different ways to calculate them
 Detailed evaluation of all features

 Found how much a greedy Feature Selection (FS)
  can improve classification results




                                                   5
RELATION REPRESENTATION
   A relation consists of two parts:
     Trigger: the main evidence of a relation (or an event)
     Argument(s): complementary information of the relation
      and involved entities




                                                               6
Trigger: Binding Argument: Protein


 Cross-linking of CD30 induces HIV in chronically infected T cells




    Trigger: Binding


Upon engagement of CD40 by CD40 ligand (CD40L)



                                                                     7
PREVIOUS WORKS: FEATURE SELECTION
 Forman 2003: Extensive survey of FS metrics for
  text classification
 Eom et al. 2004: Feature Dimension Reduction
  Filter (FDRF) for proteins relationship extraction
  from PubMed articles
 Saeys et al. 2007: survey of FS techniques in
  bioinformatics
 Landeghem et al. 2008: a FS technique based on
  the concept of gain ratio for extracting protein-
  protein interactions
 Landeghem et al. 2010: ensemble feature selection
                                                       8
  for biomolecular text mining
METHOD OUTLINE
 Preprocessing
 Building classifier instances

 Trigger extraction




                                  9
PREPROCESSING
 How to convert string values to numeric features?
 For string attributes, for each value one attribute will
  be created




                                                             10
BUILDING CLASSIFIER INSTANCES
 Create bags of triggers for each relation type from
  the train-set
 If a phrase appears in the bag of triggers; a training
  example will be created for it
 For the training, label the examples based on the
  annotation
 When training, for each positive example
  maximum, 3 negative examples were generated



                                                           11
TRIGGER EXTRACTION
   Double-layered machine learning approach for
    trigger classification
     First layer: one-vs-rest binary classification for each
      relation type (Used only SVM)
     Second layer: one classifier vote first layer classifiers to
      make final decision (Evaluated SVM, Decision Tree and
      Logistic Regression)




                                                                     12
SEMANTIC FEATURE
   Distributional semantic similarity
     Random Indexing
     SemanticVector Java package

 Semantic similarity of a trigger candidate to known
  triggers BOW
 Semantic features can be calculated in different
  ways:
     Maximum similarity to the relation type’s BOW
     Average similarity to the relation type’s BOW



                                                        13
MAXIMUM VS. AVERAGE

   “Binding” event type
                                      “Binding” triggers BOW
              interacting
                                                0.6
                                                               recruitment

                                     0.9        binding

                               0.8
                                                               …
                                           ligation
    MaxSimilarity = 0.9
    AverageSimilarity = 0.76



                                                                             14
ONE VS. ALL SEMANTIC FEATURES
     One Semantic Feature: Including just similarity to
      related BOW for each relation type classifier
     All Semantic Features: Including similarity to all
      BOWs for all relation types classifiers


        Similarity to   Similarity to     Similarity to
         “Binding”      “Regulation”     “Localization”




         “Binding”      “Regulation”     “Localization”
         classifier       classifier       classifier

                                                          15
FEATURE SELECTION

                          Improving           Improving
                           features            features

                             Keep


  All features              Neutral            Neutral
 (29 features)             features           features

                         Further evaluation        Continue …



   Repeat this process     Worsening          Worsening
   for each event type      features           features
                                                                16
                             Remove
SELECTED FEATURES
Event type        Selected features                                                     Count

Localization      PhraseText, SentenceTFIDF                                               2

Protein           POS, PorterStem, SentenceTFIDF                                          3
catabolism
Phosphorylation   POS, PorterStem, SentenceTFIDF                                          3

Positive          POS, POSNext1, POSNext2, POSPre1, POSPre2, PorterStem                   6
regulation
Gene expression   PorterStem, WordnetStem, PhraseText, AllUpperCase, HasDigit,            8
                  QuoteRightCount, ProteinCountInWindow, ProteinCountInSentence
Binding           WordnetStem, OriginalWord, StartWithUppercase, AllUppercase,            8
                  AllLowercase,HasDigit, ProteinCountInWindow, ProteinCountInSentence
Transcription     POS, POSNext1, POSNext2, PhraseText, CommaRightCount, POSPre1,          9
                  POSPre2, ProteinCountInSentence, SentenceTFIDF
Negative          PorterStem, AllUppercase, AllLowercase, HasSpecialChars, HasDigit,      9
regulation        QuoteRightCount, NameEntity, ProteinCountInWindow,
                  ProteinCountInSentence

Regulation        AllLowercase, AllUppercase, CommaLeftCount, HasDigit, MESHHeading,     15
                  NameEntity, OriginalWord, PorterStem, POSNext2, POSPre2,               17
                  ProteinCountInSentence, ProteinCountInWindow, SentenceTFIDF,
                  StartWithUppercase, WordnetStem
RESULTS: FEATURE SELECTION
Event type            Before FS    After FS      Change
Protein catabolism         16.51         70.47     53.96
Negative regulation        21.67         43.62     21.95

Localization               31.11         51.98     20.87
Phosphorylation            60.59         79.52     18.93
Regulation                  1.90         14.32     12.42
Gene expression            66.43         68.62     2.19
Transcription              29.17         30.48     1.31
Binding                    54.49         54.42     -0.07
Positive regulation        40.39         35.37     -5.02




                                                           18
RESULTS: SEMANTIC FEATURES
 90


 80


 70


 60


 50


 40
                             Without SF
 30                          ONE_SF
                             ALL_SF
 20


 10


  0




                                      19
RESULTS: MAX VS. AVE
90


80


70


60


50


40

                       Maximum
30
                       Average
20


10


 0




                          20
RESULTS: SVM VS. LOGISTIC REGRESSION
90
80
70
60
50
40
                                  SVM
30
                                  Logistic Regression
20
10
0



                                                21
CONCLUSION
 We found different optimized feature set for each
  event type and FS can improve classification of
  triggers (7 out of 9 event types)
 Semantic feature, namely distributional semantic
  can improve classification results up to 19.37% F-
  Measure
 Using all semantic features for all classifiers is
  better than using only related semantic feature (for
  most of the event types)
 “Maximum” is better when BOWs are very different
  but “Average” is better when BOWs are very similar
                                                         23
FUTURE WORKS
 Compare different semantic similarity kernels
 Compare other FS methods to the one we used in
  this work
 Try manually created trigger BOWs




                                                   24
Special thanks to: Dr. Trevor Cohen, Robert Leaman, Azadeh Nikfarjam
and Nate Sutton. This work is supported by funding from NLM Contract
HHSN276201000031C.

Relationship extraction tool:
BioEvent (http://bioevent.sf.net)
Distributional semantic similarity:
The Semantic Vectors Package(http://code.google.com/p/semanticvectors/)




                                                                          25

More Related Content

Similar to Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text

CDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringCDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoring
Marco Antoniotti
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
BITS
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Joachim Jacob
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
GigaScience, BGI Hong Kong
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
Jeremy Besnard
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identification
Jhih-Ming Chen
 
Public profile
Public profilePublic profile
Public profile
Oleg Urzhumtsev
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
Wagied Davids
 
Ch 6 randomization
Ch 6 randomizationCh 6 randomization
Ch 6 randomization
Team-VLSI-ITMU
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 

Similar to Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text (13)

CDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringCDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoring
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identification
 
Public profile
Public profilePublic profile
Public profile
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Ch 6 randomization
Ch 6 randomizationCh 6 randomization
Ch 6 randomization
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 

Recently uploaded

Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
melliereed
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
TechSoup
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
Celine George
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
khuleseema60
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 

Recently uploaded (20)

Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 

Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text

  • 1. EVALUATING DISTRIBUTIONAL SEMANTIC AND FEATURE SELECTION FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL TEXT Ehsan Emadzadeh* Siddhartha Jonnalagadda † Graciela Gonzalez* *Department of Biomedical Informatics, Arizona State University †Department of Health Sciences Research, Mayo Clinic
  • 2. PROBLEMS  What are the useful features for biological relation extraction?  Word level semantic features like distributional semantic contribution for relationship extraction is unknown  Which method for calculating semantic features is the best? 3
  • 3. CORPUS  We used BioNLP 2011 GENIA corpus  Training set includes 800 abstracts and 5 full papers  Test set includes 150 abstracts and 5 full papers 4000 3500 3000 2500 2000 1500 1000 Train set 500 Test set 0 4
  • 4. OUR CONTRIBUTION  Evaluated distributional semantic features and different ways to calculate them  Detailed evaluation of all features  Found how much a greedy Feature Selection (FS) can improve classification results 5
  • 5. RELATION REPRESENTATION  A relation consists of two parts:  Trigger: the main evidence of a relation (or an event)  Argument(s): complementary information of the relation and involved entities 6
  • 6. Trigger: Binding Argument: Protein Cross-linking of CD30 induces HIV in chronically infected T cells Trigger: Binding Upon engagement of CD40 by CD40 ligand (CD40L) 7
  • 7. PREVIOUS WORKS: FEATURE SELECTION  Forman 2003: Extensive survey of FS metrics for text classification  Eom et al. 2004: Feature Dimension Reduction Filter (FDRF) for proteins relationship extraction from PubMed articles  Saeys et al. 2007: survey of FS techniques in bioinformatics  Landeghem et al. 2008: a FS technique based on the concept of gain ratio for extracting protein- protein interactions  Landeghem et al. 2010: ensemble feature selection 8 for biomolecular text mining
  • 8. METHOD OUTLINE  Preprocessing  Building classifier instances  Trigger extraction 9
  • 9. PREPROCESSING  How to convert string values to numeric features?  For string attributes, for each value one attribute will be created 10
  • 10. BUILDING CLASSIFIER INSTANCES  Create bags of triggers for each relation type from the train-set  If a phrase appears in the bag of triggers; a training example will be created for it  For the training, label the examples based on the annotation  When training, for each positive example maximum, 3 negative examples were generated 11
  • 11. TRIGGER EXTRACTION  Double-layered machine learning approach for trigger classification  First layer: one-vs-rest binary classification for each relation type (Used only SVM)  Second layer: one classifier vote first layer classifiers to make final decision (Evaluated SVM, Decision Tree and Logistic Regression) 12
  • 12. SEMANTIC FEATURE  Distributional semantic similarity  Random Indexing  SemanticVector Java package  Semantic similarity of a trigger candidate to known triggers BOW  Semantic features can be calculated in different ways:  Maximum similarity to the relation type’s BOW  Average similarity to the relation type’s BOW 13
  • 13. MAXIMUM VS. AVERAGE  “Binding” event type “Binding” triggers BOW interacting 0.6 recruitment 0.9 binding 0.8 … ligation MaxSimilarity = 0.9 AverageSimilarity = 0.76 14
  • 14. ONE VS. ALL SEMANTIC FEATURES  One Semantic Feature: Including just similarity to related BOW for each relation type classifier  All Semantic Features: Including similarity to all BOWs for all relation types classifiers Similarity to Similarity to Similarity to “Binding” “Regulation” “Localization” “Binding” “Regulation” “Localization” classifier classifier classifier 15
  • 15. FEATURE SELECTION Improving Improving features features Keep All features Neutral Neutral (29 features) features features Further evaluation Continue … Repeat this process Worsening Worsening for each event type features features 16 Remove
  • 16. SELECTED FEATURES Event type Selected features Count Localization PhraseText, SentenceTFIDF 2 Protein POS, PorterStem, SentenceTFIDF 3 catabolism Phosphorylation POS, PorterStem, SentenceTFIDF 3 Positive POS, POSNext1, POSNext2, POSPre1, POSPre2, PorterStem 6 regulation Gene expression PorterStem, WordnetStem, PhraseText, AllUpperCase, HasDigit, 8 QuoteRightCount, ProteinCountInWindow, ProteinCountInSentence Binding WordnetStem, OriginalWord, StartWithUppercase, AllUppercase, 8 AllLowercase,HasDigit, ProteinCountInWindow, ProteinCountInSentence Transcription POS, POSNext1, POSNext2, PhraseText, CommaRightCount, POSPre1, 9 POSPre2, ProteinCountInSentence, SentenceTFIDF Negative PorterStem, AllUppercase, AllLowercase, HasSpecialChars, HasDigit, 9 regulation QuoteRightCount, NameEntity, ProteinCountInWindow, ProteinCountInSentence Regulation AllLowercase, AllUppercase, CommaLeftCount, HasDigit, MESHHeading, 15 NameEntity, OriginalWord, PorterStem, POSNext2, POSPre2, 17 ProteinCountInSentence, ProteinCountInWindow, SentenceTFIDF, StartWithUppercase, WordnetStem
  • 17. RESULTS: FEATURE SELECTION Event type Before FS After FS Change Protein catabolism 16.51 70.47 53.96 Negative regulation 21.67 43.62 21.95 Localization 31.11 51.98 20.87 Phosphorylation 60.59 79.52 18.93 Regulation 1.90 14.32 12.42 Gene expression 66.43 68.62 2.19 Transcription 29.17 30.48 1.31 Binding 54.49 54.42 -0.07 Positive regulation 40.39 35.37 -5.02 18
  • 18. RESULTS: SEMANTIC FEATURES 90 80 70 60 50 40 Without SF 30 ONE_SF ALL_SF 20 10 0 19
  • 19. RESULTS: MAX VS. AVE 90 80 70 60 50 40 Maximum 30 Average 20 10 0 20
  • 20. RESULTS: SVM VS. LOGISTIC REGRESSION 90 80 70 60 50 40 SVM 30 Logistic Regression 20 10 0 21
  • 21. CONCLUSION  We found different optimized feature set for each event type and FS can improve classification of triggers (7 out of 9 event types)  Semantic feature, namely distributional semantic can improve classification results up to 19.37% F- Measure  Using all semantic features for all classifiers is better than using only related semantic feature (for most of the event types)  “Maximum” is better when BOWs are very different but “Average” is better when BOWs are very similar 23
  • 22. FUTURE WORKS  Compare different semantic similarity kernels  Compare other FS methods to the one we used in this work  Try manually created trigger BOWs 24
  • 23. Special thanks to: Dr. Trevor Cohen, Robert Leaman, Azadeh Nikfarjam and Nate Sutton. This work is supported by funding from NLM Contract HHSN276201000031C. Relationship extraction tool: BioEvent (http://bioevent.sf.net) Distributional semantic similarity: The Semantic Vectors Package(http://code.google.com/p/semanticvectors/) 25