SlideShare a Scribd company logo
Relationship Extraction from Text

   Extending the Espresso Method for Greater Recall


                  Derek Springer
         UCLA Computer Science Department
                November 19, 2009
Related Works

• Ganapathi, Swathi. “Relationship Extraction from Text:
  Comparison and Experimental Evaluation of the State-of-
  the-Art.” UCLA comp exam. March 2009.
• Chu, A., Sakurai, S., Cárdenas, A. F., "Automatic Detection
  of Treatment Relationships in Patent Retrieval." 2008 CIKM
  Patent Information Retrieval Workshop. October 2008.
Related Works, cont'd

• Girju, R. "Automatic Detection of Causal Relations for
  Question Answering." In the proceedings of the 41st Annual
  Meeting of the Association for Computational Linguistics
  (ACL 2003). Workshop on "Multilingual Summarization and
  Question Answering - Machine Learning and Beyond".
  2003.
• Pantel, Patrick and Pennacchiotti, Marco. "Espresso:
  Leveraging Generic Patterns for Automatically Harvesting
  Semantic Relations." In Proceedings of Conference on
  Computational Linguistics / Association for Computational
  Linguistics (COLING/ACL- 06). pp. 113-120.
  Sydney, Australia. 2006.
Relationship Extraction

• The task of recognizing the assertion of a
  particular relationship between two or more
  entities in text.
• Can aid in the development of
  standalone, intelligent, automated and adaptable
  user-specific content retrieval systems.
• We focus on extracting treatment relationships
  → A (subject) used to treat B (object).
Goals and Contributions

• Extended state-of-the-art Espresso relationship
  extraction system originally implemented by
  Ganapathi.
• Did an in-depth experimental evaluation of the
  developed system while comparing it to prior
  work (Chu, Ganapathi).
• Future goal is to use the system developed here
  as a plug for relationship feature extractor in
  iScore.
Integration Into iScore

• iScore presents additional articles based on an
  aggregate score of “interestingness.”
• We believe filtering articles based on
  relationships can improve the results of iScore.
• We hypothesize that extending the Espresso
  system implemented by Swathi Ganapathi will
  improve the ability of a system such as iScore to
  utilize relationship extraction as a feature.
Comparison Criteria

• Performance: Want system to have high
  precision and recall
• Minimal Supervision: Want system to require
  little to no human supervision
• Breadth: Want system to extract relations from
  varying corpus sizes, domains and formats.
• Generality: Want system to extract wide variety
  of relation types without losing its edge in any of
  the above criteria.
The Espresso Algorithm

• General purpose algorithm which can be used to
  extract a wide variety of binary relations.
• Requires minimal supervision. Only input is a
  small seed set of known relations.
• By looking at individual sentences in detecting
  relationships, works well on all kinds of corpora.
• On tests conducted by the creators of the
  algorithm, Espresso generated balanced
  precision and recall.
The Espresso Method
Extending Espresso

Ganapathi's                           37.8%
Implementation



Extension                             91.2%
Ganapathi's Implementation

• Ganapathi's approach uses lexico-syntactic
  patterns of the form NP1 VP NP2 (Verb category
  in Table 1).
• VP contains treatment verb or pattern and the
  two NPs would contain the subject and object.
• This structure is a very common
  relationship, accounting for 37.8% of all
  relationships.
Extension

• There still remains a large number of
  relationships that may provide fruitful results.
• Expanding the implementation to include:
  - Noun+Prep e.g. "X settlement with Y"
  - Verb+Prep e.g. "X moved to Y"
  - Infinitive e.g. "X plans to acquire Y" and
  - Modifier e.g. "X is Y winner" relationship
• Retrieves 91.2% of common relationships.
Test Corpora

• Patent Corpus: Developed by Shige
   o 50,000 drug patent documents from 2008 from Class 424 & 514 of
     the U.S. Patents Classification: “drug, bio-affecting and body
     treating compositions” and their subclasses.
   o Patents were pre-filtered to only contain keywords
     “diabetes”, “metastatic”, “cancer”, “tuberculosis”, “lung”, “bronchitis”,
      “coronary artery”
   o All sentences from each document added to a sentence table in the
     schema
• PubMed Corpus: Developed by Gustavo
   o Comprised of medical abstracts from PubMed
   o Each abstract was parsed and all sentences from each abstract
     was stored as individual tuples in the sentence table
Performance Measures
Seed Treatment Relationships

•   (Xanax, Anxiety)           •   (Glycoside, Depression)
•   (Ambien, Insomnia)         •   (Ibuprofen, Arthritis)
•   (Effexor, Depression)      •   (Ibuprofen, Headache)
•   (Paxil, Depression)        •   (Tylenol, Fever)
•   (Lexapro, Depression)      •   (Tylenol, Headache)
•   (Caffeine, Depression)     •   (Antibody, Inflammation)
•   (Zoloft, Depression)       •   (Ibuprofen, Inflammation)
•   (Imipramine, Depression)   •   (Surgery, Glaucoma)
Procedure

1.Re-tag original data set to incorporate extended
  relationship types.
2.Re-run Ganapathi's baseline Espresso
  implementation to compare against updated data
  set.
3.Run extended Espresso implementation to
  compare against updated data set.
Experiment #1: Extraction on Drug
        Patent Corpus
• Drug Patent corpus used.
• Algorithm was run with seed relations and 12 verbs were extracted as
  being relevant (verbs with rπ greater than 0.2).
• These treatment verbs were used to create a test sentence set of 120
  sentences i.e. 10 sentences containing a treatment verb for every
  relevant treatment verb.
• 358 possible relations were extracted for each of which we calculated
  the ri score.
• 208 relations were obtained with ri score greater than the threshold out
  of which 126 were actually correct (through manual tagging).
• Of the original 358 relations, manual tagging determined that 213 of
  them were correct treatment relations.
Experiment #1 Results
Experiment #2: Number of
 Relationships and Performance
• Drug Patent corpus used.
• Test the performance of the system under
  smaller and larger data loads.
• Started with initial set of 120 sentences obtained
  from Drug Patent corpus (10 sentences for each
  verb, 12 verbs as in test #1)
• Increased the number of sentences for each
  verb by 10 in each case, so that we had
  sentence sets of 240 and 360 sentences each
Experiment #2 Results
Experiment #2 Analysis

• Performance of the system and the number of
  relationships are inversely related.
• ri scores are affected inversely by the max pmi across
  all relationship instances, it is possible that having more
  relationship instances in a set lowers the ri for all those
  relationships.
• more relationships => chance of a greater max pmi =>
  lowered ri for all relationship instances.
• Not worried → articles likely won't have 200 relations of
  the same type.
Experiment #3: Extraction on
          PubMed Corpus
• PubMed corpus used.
• Want to test the performance of the system on a different
  type and sized corpus
• Algorithm was run with input seed relations on this corpus
  and10 verbs with the topmost rπ values were extracted
• We constructed a test sentence set of 80 sentences (8
  sentences for every relevant verb)
• We then extracted a total of 162 relations from this test set
  and calculated their ri scores.
• The average ri score was used as the threshold value
Experiment #3 Results
Comparison Over Both Corpora
Experiment #3 Analysis

• Performance is worse on PubMed corpus.
• Patent corpus dealt with drugs and cures for diseases.
• Therefore, there was an abundance of treatment type
  relations in patent corpus.
• PubMed had more general medical data and only
  contained abstracts => less info.
• Therefore, there were fewer treatment relations in
  PubMed which affected performance.
Comparison with Previous Work




               * signifies our contribution
Analysis

• F-score of Ganapathi's version of Espresso fell
  nearly 10% → due to lower recall, as predicted.
• Results of extension over the re-tagged data are
  on par with Ganapathi's original results.
• When you consider that Ganapathi's system
  dropped nearly 10%, it seems to indicate the
  increased general purpose nature of the
  extension over the original version.
Success

• Recall of system is more important than
  precision, especially when it comes to using
  relationships as a feature in iScore.
• Method is almost completely automated.
• Easily expanded to extract other relationship types by
  changing the input seed relations.
• Initial results seem insignificant, but analysis indicates
  that extended system has the potential to be a general-
  purpose relationship extraction feature.
Future Work

• Development of a relationship feature extractor
  for iScore.
• Relations will have to be syntactically and
  semantically compared with relations present in
  other articles and the best article matches will be
  returned as “interesting” choices for a user.
• Optimizations: algorithm design
  improvements, database connection
  optimizations and parallelization.

More Related Content

What's hot

Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
Shakas Technologies
 
ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19
Angelo Pugliese
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Greg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 TutorialInstance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Holistic Benchmarking of Big Linked Data
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
tsysglobalsolutions
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
Pistoia Alliance
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
ijcseit
 

What's hot (10)

Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 TutorialInstance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
 

Viewers also liked

Cytokine purine interactions in behavioral depression in rats
Cytokine purine interactions in behavioral depression in ratsCytokine purine interactions in behavioral depression in rats
Cytokine purine interactions in behavioral depression in rats
zpzp0312
 
Extracto La hija de los sueños
Extracto La hija de los sueñosExtracto La hija de los sueños
Extracto La hija de los sueños
Editorial Océano Chile
 
Por qué las suecas son mito erótico
Por qué las suecas son mito erótico Por qué las suecas son mito erótico
Por qué las suecas son mito erótico
Editorial Océano Chile
 
Screen Robots: UI Tests in Espresso
Screen Robots: UI Tests in EspressoScreen Robots: UI Tests in Espresso
Screen Robots: UI Tests in Espresso
Annyce Davis
 
Ui testing with espresso
Ui testing with espressoUi testing with espresso
Ui testing with espresso
Droidcon Spain
 
Oh so you test? - A guide to testing on Android from Unit to Mutation
Oh so you test? - A guide to testing on Android from Unit to MutationOh so you test? - A guide to testing on Android from Unit to Mutation
Oh so you test? - A guide to testing on Android from Unit to Mutation
Paul Blundell
 
Do You Enjoy Espresso in Android App Testing?
Do You Enjoy Espresso in Android App Testing?Do You Enjoy Espresso in Android App Testing?
Do You Enjoy Espresso in Android App Testing?
Bitbar
 
A guide to Android automated testing
A guide to Android automated testingA guide to Android automated testing
A guide to Android automated testing
jotaemepereira
 
Fast deterministic screenshot tests for Android
Fast deterministic screenshot tests for AndroidFast deterministic screenshot tests for Android
Fast deterministic screenshot tests for Android
Arnold Noronha
 
Android Espresso
Android EspressoAndroid Espresso
Android Espresso
Armando Picón Z.
 
Testing android apps with espresso
Testing android apps with espressoTesting android apps with espresso
Testing android apps with espresso
Édipo Souza
 
Automation test for Android
Automation test for AndroidAutomation test for Android
Automation test for Android
Somkiat Puisungnoen
 
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Paris Android User Group
 
CITOQUINAS. Fisiología General
CITOQUINAS. Fisiología GeneralCITOQUINAS. Fisiología General
CITOQUINAS. Fisiología General
Lola FFB
 
Android Test Automation Workshop
Android Test Automation WorkshopAndroid Test Automation Workshop
Android Test Automation Workshop
Eduardo Carrara de Araujo
 
Interleucinas-Citocinas
Interleucinas-CitocinasInterleucinas-Citocinas
Interleucinas-Citocinas
Silvia Montes De Oca Chacón
 
Utilizando Espresso e UIAutomator no Teste de Apps Android
Utilizando Espresso e UIAutomator no Teste de Apps AndroidUtilizando Espresso e UIAutomator no Teste de Apps Android
Utilizando Espresso e UIAutomator no Teste de Apps Android
Eduardo Carrara de Araujo
 
Android Unit Tesing at I/O rewind 2015
Android Unit Tesing at I/O rewind 2015Android Unit Tesing at I/O rewind 2015
Android Unit Tesing at I/O rewind 2015
Somkiat Puisungnoen
 
Espresso Barista
Espresso BaristaEspresso Barista

Viewers also liked (20)

Cytokine purine interactions in behavioral depression in rats
Cytokine purine interactions in behavioral depression in ratsCytokine purine interactions in behavioral depression in rats
Cytokine purine interactions in behavioral depression in rats
 
Extracto La hija de los sueños
Extracto La hija de los sueñosExtracto La hija de los sueños
Extracto La hija de los sueños
 
Por qué las suecas son mito erótico
Por qué las suecas son mito erótico Por qué las suecas son mito erótico
Por qué las suecas son mito erótico
 
Screen Robots: UI Tests in Espresso
Screen Robots: UI Tests in EspressoScreen Robots: UI Tests in Espresso
Screen Robots: UI Tests in Espresso
 
Ui testing with espresso
Ui testing with espressoUi testing with espresso
Ui testing with espresso
 
Oh so you test? - A guide to testing on Android from Unit to Mutation
Oh so you test? - A guide to testing on Android from Unit to MutationOh so you test? - A guide to testing on Android from Unit to Mutation
Oh so you test? - A guide to testing on Android from Unit to Mutation
 
Do You Enjoy Espresso in Android App Testing?
Do You Enjoy Espresso in Android App Testing?Do You Enjoy Espresso in Android App Testing?
Do You Enjoy Espresso in Android App Testing?
 
A guide to Android automated testing
A guide to Android automated testingA guide to Android automated testing
A guide to Android automated testing
 
Fast deterministic screenshot tests for Android
Fast deterministic screenshot tests for AndroidFast deterministic screenshot tests for Android
Fast deterministic screenshot tests for Android
 
Android Espresso
Android EspressoAndroid Espresso
Android Espresso
 
Testing android apps with espresso
Testing android apps with espressoTesting android apps with espresso
Testing android apps with espresso
 
Automation test for Android
Automation test for AndroidAutomation test for Android
Automation test for Android
 
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
 
CITOQUINAS. Fisiología General
CITOQUINAS. Fisiología GeneralCITOQUINAS. Fisiología General
CITOQUINAS. Fisiología General
 
Android Test Automation Workshop
Android Test Automation WorkshopAndroid Test Automation Workshop
Android Test Automation Workshop
 
Abordaje depresion y tiroides svp
Abordaje depresion y tiroides svpAbordaje depresion y tiroides svp
Abordaje depresion y tiroides svp
 
Interleucinas-Citocinas
Interleucinas-CitocinasInterleucinas-Citocinas
Interleucinas-Citocinas
 
Utilizando Espresso e UIAutomator no Teste de Apps Android
Utilizando Espresso e UIAutomator no Teste de Apps AndroidUtilizando Espresso e UIAutomator no Teste de Apps Android
Utilizando Espresso e UIAutomator no Teste de Apps Android
 
Android Unit Tesing at I/O rewind 2015
Android Unit Tesing at I/O rewind 2015Android Unit Tesing at I/O rewind 2015
Android Unit Tesing at I/O rewind 2015
 
Espresso Barista
Espresso BaristaEspresso Barista
Espresso Barista
 

Similar to Extending the Espresso Method for Greater Recall

Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
robertstevens65
 
Recommender system
Recommender systemRecommender system
Recommender system
Bhumi Patel
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
jpstudcorner
 
Rule based method for entity resolution
Rule based method for entity resolutionRule based method for entity resolution
Rule based method for entity resolution
ieeepondy
 
G filter a general gram filter for string similarity search
G filter a general gram filter for string similarity searchG filter a general gram filter for string similarity search
G filter a general gram filter for string similarity searchieeepondy
 
weka data mining
weka data mining weka data mining
weka data mining
kalthoom almaqbali
 
UNIT 4.pptx
UNIT 4.pptxUNIT 4.pptx
UNIT 4.pptx
SreeLatha98
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
vaibhavdeoda
 
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
ijiert bestjournal
 
Expert System With Python -1
Expert System With Python -1Expert System With Python -1
Expert System With Python -1
Ahmad Hussein
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter Optimization
Priyatham Bollimpalli
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
Jonathas Magalhães
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 

Similar to Extending the Espresso Method for Greater Recall (20)

Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Rule based method for entity resolution
Rule based method for entity resolutionRule based method for entity resolution
Rule based method for entity resolution
 
G filter a general gram filter for string similarity search
G filter a general gram filter for string similarity searchG filter a general gram filter for string similarity search
G filter a general gram filter for string similarity search
 
weka data mining
weka data mining weka data mining
weka data mining
 
UNIT 4.pptx
UNIT 4.pptxUNIT 4.pptx
UNIT 4.pptx
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
Competition16
Competition16Competition16
Competition16
 
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
 
Expert System With Python -1
Expert System With Python -1Expert System With Python -1
Expert System With Python -1
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter Optimization
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 

Extending the Espresso Method for Greater Recall

  • 1. Relationship Extraction from Text Extending the Espresso Method for Greater Recall Derek Springer UCLA Computer Science Department November 19, 2009
  • 2. Related Works • Ganapathi, Swathi. “Relationship Extraction from Text: Comparison and Experimental Evaluation of the State-of- the-Art.” UCLA comp exam. March 2009. • Chu, A., Sakurai, S., Cárdenas, A. F., "Automatic Detection of Treatment Relationships in Patent Retrieval." 2008 CIKM Patent Information Retrieval Workshop. October 2008.
  • 3. Related Works, cont'd • Girju, R. "Automatic Detection of Causal Relations for Question Answering." In the proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003). Workshop on "Multilingual Summarization and Question Answering - Machine Learning and Beyond". 2003. • Pantel, Patrick and Pennacchiotti, Marco. "Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations." In Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL- 06). pp. 113-120. Sydney, Australia. 2006.
  • 4. Relationship Extraction • The task of recognizing the assertion of a particular relationship between two or more entities in text. • Can aid in the development of standalone, intelligent, automated and adaptable user-specific content retrieval systems. • We focus on extracting treatment relationships → A (subject) used to treat B (object).
  • 5. Goals and Contributions • Extended state-of-the-art Espresso relationship extraction system originally implemented by Ganapathi. • Did an in-depth experimental evaluation of the developed system while comparing it to prior work (Chu, Ganapathi). • Future goal is to use the system developed here as a plug for relationship feature extractor in iScore.
  • 6. Integration Into iScore • iScore presents additional articles based on an aggregate score of “interestingness.” • We believe filtering articles based on relationships can improve the results of iScore. • We hypothesize that extending the Espresso system implemented by Swathi Ganapathi will improve the ability of a system such as iScore to utilize relationship extraction as a feature.
  • 7. Comparison Criteria • Performance: Want system to have high precision and recall • Minimal Supervision: Want system to require little to no human supervision • Breadth: Want system to extract relations from varying corpus sizes, domains and formats. • Generality: Want system to extract wide variety of relation types without losing its edge in any of the above criteria.
  • 8. The Espresso Algorithm • General purpose algorithm which can be used to extract a wide variety of binary relations. • Requires minimal supervision. Only input is a small seed set of known relations. • By looking at individual sentences in detecting relationships, works well on all kinds of corpora. • On tests conducted by the creators of the algorithm, Espresso generated balanced precision and recall.
  • 10. Extending Espresso Ganapathi's 37.8% Implementation Extension 91.2%
  • 11. Ganapathi's Implementation • Ganapathi's approach uses lexico-syntactic patterns of the form NP1 VP NP2 (Verb category in Table 1). • VP contains treatment verb or pattern and the two NPs would contain the subject and object. • This structure is a very common relationship, accounting for 37.8% of all relationships.
  • 12. Extension • There still remains a large number of relationships that may provide fruitful results. • Expanding the implementation to include: - Noun+Prep e.g. "X settlement with Y" - Verb+Prep e.g. "X moved to Y" - Infinitive e.g. "X plans to acquire Y" and - Modifier e.g. "X is Y winner" relationship • Retrieves 91.2% of common relationships.
  • 13. Test Corpora • Patent Corpus: Developed by Shige o 50,000 drug patent documents from 2008 from Class 424 & 514 of the U.S. Patents Classification: “drug, bio-affecting and body treating compositions” and their subclasses. o Patents were pre-filtered to only contain keywords “diabetes”, “metastatic”, “cancer”, “tuberculosis”, “lung”, “bronchitis”, “coronary artery” o All sentences from each document added to a sentence table in the schema • PubMed Corpus: Developed by Gustavo o Comprised of medical abstracts from PubMed o Each abstract was parsed and all sentences from each abstract was stored as individual tuples in the sentence table
  • 15. Seed Treatment Relationships • (Xanax, Anxiety) • (Glycoside, Depression) • (Ambien, Insomnia) • (Ibuprofen, Arthritis) • (Effexor, Depression) • (Ibuprofen, Headache) • (Paxil, Depression) • (Tylenol, Fever) • (Lexapro, Depression) • (Tylenol, Headache) • (Caffeine, Depression) • (Antibody, Inflammation) • (Zoloft, Depression) • (Ibuprofen, Inflammation) • (Imipramine, Depression) • (Surgery, Glaucoma)
  • 16. Procedure 1.Re-tag original data set to incorporate extended relationship types. 2.Re-run Ganapathi's baseline Espresso implementation to compare against updated data set. 3.Run extended Espresso implementation to compare against updated data set.
  • 17. Experiment #1: Extraction on Drug Patent Corpus • Drug Patent corpus used. • Algorithm was run with seed relations and 12 verbs were extracted as being relevant (verbs with rπ greater than 0.2). • These treatment verbs were used to create a test sentence set of 120 sentences i.e. 10 sentences containing a treatment verb for every relevant treatment verb. • 358 possible relations were extracted for each of which we calculated the ri score. • 208 relations were obtained with ri score greater than the threshold out of which 126 were actually correct (through manual tagging). • Of the original 358 relations, manual tagging determined that 213 of them were correct treatment relations.
  • 19. Experiment #2: Number of Relationships and Performance • Drug Patent corpus used. • Test the performance of the system under smaller and larger data loads. • Started with initial set of 120 sentences obtained from Drug Patent corpus (10 sentences for each verb, 12 verbs as in test #1) • Increased the number of sentences for each verb by 10 in each case, so that we had sentence sets of 240 and 360 sentences each
  • 21. Experiment #2 Analysis • Performance of the system and the number of relationships are inversely related. • ri scores are affected inversely by the max pmi across all relationship instances, it is possible that having more relationship instances in a set lowers the ri for all those relationships. • more relationships => chance of a greater max pmi => lowered ri for all relationship instances. • Not worried → articles likely won't have 200 relations of the same type.
  • 22. Experiment #3: Extraction on PubMed Corpus • PubMed corpus used. • Want to test the performance of the system on a different type and sized corpus • Algorithm was run with input seed relations on this corpus and10 verbs with the topmost rπ values were extracted • We constructed a test sentence set of 80 sentences (8 sentences for every relevant verb) • We then extracted a total of 162 relations from this test set and calculated their ri scores. • The average ri score was used as the threshold value
  • 25. Experiment #3 Analysis • Performance is worse on PubMed corpus. • Patent corpus dealt with drugs and cures for diseases. • Therefore, there was an abundance of treatment type relations in patent corpus. • PubMed had more general medical data and only contained abstracts => less info. • Therefore, there were fewer treatment relations in PubMed which affected performance.
  • 26. Comparison with Previous Work * signifies our contribution
  • 27. Analysis • F-score of Ganapathi's version of Espresso fell nearly 10% → due to lower recall, as predicted. • Results of extension over the re-tagged data are on par with Ganapathi's original results. • When you consider that Ganapathi's system dropped nearly 10%, it seems to indicate the increased general purpose nature of the extension over the original version.
  • 28. Success • Recall of system is more important than precision, especially when it comes to using relationships as a feature in iScore. • Method is almost completely automated. • Easily expanded to extract other relationship types by changing the input seed relations. • Initial results seem insignificant, but analysis indicates that extended system has the potential to be a general- purpose relationship extraction feature.
  • 29. Future Work • Development of a relationship feature extractor for iScore. • Relations will have to be syntactically and semantically compared with relations present in other articles and the best article matches will be returned as “interesting” choices for a user. • Optimizations: algorithm design improvements, database connection optimizations and parallelization.