Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Seed Selection for Distantly Supervised Web-Based Relation Extraction

688 views

Published on

Slides of my presentation on "Seed Selection for Distantly Supervised Web-Based Relation Extraction" at the Semantic Web and Information Extraction workshop (SWAIE) and COLING 2014
Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/SWAIE2014-Seed.pdf

Published in: Technology
  • Be the first to comment

Seed Selection for Distantly Supervised Web-Based Relation Extraction

  1. 1. Seed Selection for Distantly Supervised Web-Based Relation Extraction Isabelle Augenstein Department of Computer Science, University of Sheffield, UK i.augenstein@dcs.shef.ac.uk August 24, 2014 Semantic Web for Information Extraction (SWAIE) Workshop, COLING 2014
  2. 2. 2 Motivation •  Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase) Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  3. 3. 3 Motivation •  Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase) •  What are possible methodologies? •  Supervised learning: manually annotate text, train machine learning classifier •  Unsupervised learning: extract language patterns, cluster similar ones •  Semi-supervised learning: start with a small number of language patterns, iteratively learn more (bootstrapping) •  Distant supervision: automatically label text with relations from knowledge base, train machine learning classifier Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  4. 4. 4 Motivation •  Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase) •  What are possible methodologies? •  Supervised learning: manually annotate text, train machine learning classifier -> manual effort •  Unsupervised learning: extract language patterns, cluster similar ones -> difficult to map to KB, lower precision than supervised method •  Semi-supervised learning: start with a small number of language patterns, iteratively learn more (bootstrapping) -> still manual effort, semantic drift (unwanted shift in meaning) •  Distant supervision: automatically label text with relations from knowledge base, train machine learning classifier -> allows to extract relations with respect to KB, reasonably high precision, no manual effort Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  5. 5. 5 Distant Supervision Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Creating positive & negative training examples Feature Extraction Classifier Training Prediction of New Relations
  6. 6. 6 Distant Supervision Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Creating positive & negative training examples Feature Extraction Classifier Training Prediction of New Relations Supervised learning Automatically generated training data +
  7. 7. 7 Generating training data Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) “If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009) Amy Jade Winehouse was a singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz.! Blur helped to popularise the Britpop genre.! Beckham rose to fame with the all-female pop group Spice Girls.! Name Genre … Amy Winehouse Amy Jade Winehouse Wino … R&B soul jazz … … Blur … Britpop … … Spice Girls … pop … … different lexicalisations
  8. 8. 8 Generating training data: is it that easy? Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. Name Album Track The Beatles … Let It Be … Let It Be …
  9. 9. 9 Generating training data: is it that easy? Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. Name Album Track The Beatles … Let It Be … Let It Be … •  Use ‘Let It Be’ mentions as positive training examples for album or for track? •  Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt •  How can such ambiguous examples be detected? •  Develop methods to detect, then automatically discard potentially ambiguous positive and negative training data
  10. 10. 10 Seed Selection: ambiguity within an entity •  Example: Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. •  Let It Be can be both an album and a track of the musical artist The Beatles •  For every relation, consisting of a subject, a property and an object (s, p, o), is the subject related to (at least) two different objects with the same lexicalisation which express two different relations? •  Unam: •  Retrieve the number of such senses using the Freebase API •  Discard the lexicalisation of the object as positive training data if it has at least two different senses within an entity Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  11. 11. 11 Seed Selection: ambiguity across classes •  Example: common names of book authors or common genres, e.g. “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac. •  Stop: remove common words that are stopwords •  Stat: Estimate how ambiguous a lexicalisation of an object is compared to other lexicalisations of objects of the same relation •  For every lexicalisation of an object of a relation, retrieve the number of senses using the Freebase API (example: for Jack n=1066) •  Compute frequency distribution per relation with min, max, median (50th percentile), lower (25th percentile) and upper quartile (75th percentile) (example: for author: min=0, max=3059, median=10, lower=4, upper=32) •  For every lexicalisation of an object of a relation, if the number of senses > upper quartile (or the lower quartile, or median, depending on the model), discard it (example: 1066 > 32 -> Jack will be discarded) Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  12. 12. 12 Seed Selection: discarding negative seeds •  Creating negative training data: all entities which appear in the sentence with the subject, but are not in a relation with it, will be used as negative training data •  Problem: knowledge bases are incomplete •  Idea: object lexicalisations are often shared across entities, e.g. for the relation genre •  Check if an unknown lexicalisation is a lexicalisation of a different relation •  Incomp: for every lexicalisation l of a property, discard it as negative training data if any of the properties of the class we examine has an object lexicalisation l Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  13. 13. 13 Distant Supervision system: corpus •  Web crawl corpus, created using entity-specific search queries, consisting of 450k Web pages Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein Class Property / Relation Book author, characters, publication date, genre, ISBN, original language Musical Artist album, active (start), active (end), genre, record label, origin, track Politician birthdate, birthplace, educational institution, nationality, party, religion, spouses
  14. 14. 14 Distant Supervision system: relation candidate identification •  Two step process: recognise entities, then check if they appear in Freebase •  Use Stanford 7-class NERC to identify named entities (NEs) •  Problem: domain-specific entities (e.g. album, track) are often not recognised •  Solution: use heuristic to recognise (but not classify) more NEs •  Get all sequences of capitalised words, and noun sequences •  Every subsequence of those sequences is a potential NE, “pop music” -> “pop music”, “pop”, “music” Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  15. 15. 15 Distant Supervision system •  Features: standard relation extraction features, including BOW features, POS-based features, NE class •  Classifier: first-order CRF •  Model: one model per Freebase class to classify a relation candidate (occurrence) into one of the different properties or NONE, apply to respective part of corpus •  Seed Selection: apply different seed selection methods Unam, Stop, Stat75, Stat50, Stat25, Incomp •  Merging and Ranking: aggregate predictions of occurrences with same surface form •  E.g.: Dublin could have predictions MusicalArtist:album, origin and NONE •  Compute mean average of confidence values, select highest ranked one Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  16. 16. 16 Evaluation •  Split corpus equally for training and test •  Hand-annotate the portion of the test corpus which has NONE prediction (no representation in Freebase) Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  17. 17. 17 Results: precision per model, ranked by confidence y-axis: precision, x-axis: min confidence (e.g. 0.1 are all occurrences with confidence >= 0.1) Overall precision: unam_stop_stat25: 0.896 unam_stop_stat50: 0.882 unam_stop_stat75: 0.873 unam_stop: 0.842 stop: 0.84 baseline: 0.825 incomp: 0.74 Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 unam_stop_stat25 unam_stop_stat50 unam_stop_stat75 unam_stop stop baseline incompl
  18. 18. 18 Results Summary •  Best-performing model (unam_stop_stat25) has a precision of 0.896, compared to baseline model with precision of 0.825, reducing the error rate by 35% •  However, those seed selection methods all come at a small loss of the number of extractions (20%) because they reduce the amount of training data •  Removing potentially false negative training data (incomp) does not perform well •  Too many training examples are removed •  The training examples which are removed are lexicalisations which have the same types of values, those are crucial for learning •  Especially poor performance for numerical values Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  19. 19. 19 Related work on dealing with noise for distant supervision •  At least one models: •  Relaxed distant supervision assumption, assume that just one of the relation mentions is a true relation mention •  Graphical models, inference based on ranking •  Very challenging to train •  Hierarchical topic models: •  Only learn from positive training examples •  Pre-processing with multi-layer topic model to group extraction patterns to determine which ones are specific for each relation and which are not •  Pattern correlations: •  Probabilistic graphical model to group extraction patterns •  Information Retrieval approach: •  Pseudo relevance feedback, re-rank extraction patterns Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  20. 20. 20 Comparison of our approach to deal with ambiguity with related approaches •  Related approaches all try to solve the problem of ambiguity within the machine learning model •  Our approach deals with ambiguity as a pre-processing step for creating training data •  While related approaches try to address the problem of noisy data by using more complicated models, we explored how to exploit background data from the KB even further •  We explored how simple, statistical statistical methods based on data already present in the knowledge base can help to filter unreliable training data Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  21. 21. 21 Conclusions •  Simple, statistical methods based on background knowledge present in KB perform well at detecting ambiguous training examples •  Error reduction of up to 35% can be achieved by strategically selecting seed data •  Increase in precision is encouraging, however, this comes at the expense of the number of extractions (20% fewer extractions) •  Higher recall could be achieved by increasing the number of training instances initially •  Use a bigger corpus •  Make better use of knowledge contained in corpus Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  22. 22. 22 Future Work •  Distantly supervised named entity classification for relation extraction to improve performance for non-standard entities, joint models for NER and RE •  Relax distant supervision assumption to achieve a higher number of extractions: extract relations across sentence boundaries, coreference resolution •  Combined extraction models for information from text, lists and tables on Web pages to improve precision and recall Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  23. 23. 23Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Thank you for your attention! Questions? Isabelle Augenstein

×