Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Relation Extraction from the Web using Distant Supervision

1,206 views

Published on

Slides of my presentation on "Relation Extraction from the Web using Distant Supervision" at EKAW 2014. Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/EKAW2014-Relation.pdf

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Relation Extraction from the Web using Distant Supervision

  1. 1. Relation Extraction from the Web using Distant Supervision Isabelle Augenstein, Diana Maynard, Fabio Ciravegna Department of Computer Science, University of Sheffield, UK {i.augenstein,d.maynard,f.ciravegna}@dcs.shef.ac.uk EKAW 2014 November 28, 2014
  2. 2. 2 Problem • Large knowledge bases are useful for search, question answering etc. but far from complete • Approach: automatic knowledge base population (KBP) methods using Web information extraction (IE) 1) Extracting entities and relations between them from text on Web pages 2) Combining information from several sources to populate KBs
  3. 3. 3 Motivation • Why can’t we just use existing tool X?
  4. 4. 4 Existing Approaches • Why can’t we just use existing tool X? • IE methods requiring manual effort • Manually crafted extraction patters, e.g. “X is a professor at Y” • Supervised learning: statistical models, manually annotated training data as input Ø Biased towards a domain, e.g. Biology, newswire, Wikipedia • IE methods requiring no manual effort • Unsupervised learning: discovering patterns, clustering Ø Difficult to map to schema • Bootstrapping: learning patterns iteratively starting with prior knowledge, e.g. list of names Ø “Semantic drift”
  5. 5. 5 • Requirements Proposed Approach • Works for Web text • Extract with respect to knowledge base • No manual effort required • What can we do? • Use knowledge base to train statistical model • Distant supervision: automatically label text with relations from knowledge base, train machine learning classifier Ø Extract relations with respect to KB, no manual effort
  6. 6. 6 Creating positive & negative training examples Feature Extraction Classifier Training Prediction of New Relations Distant Supervision
  7. 7. 7 Distant Supervision Distant Supervision Creating positive & negative training examples Feature Extraction Classifier Training Prediction of New Relations Automatically generated training data + Supervised learning
  8. 8. 8 Distant Supervision “If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009) Amy Jade Winehouse was a singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz.! Blur helped to popularise the Britpop genre.! Beckham rose to fame with the all-female pop group Spice Girls.! Name Genre … Amy Winehouse R&B … Amy Jade Winehouse soul Wino jazz … … Blur … Britpop … … Spice Girls … pop … … different lexicalisations
  9. 9. 9 • Collect corpus Distant Supervision System • From Web, using search patterns containing relation • Relation identification • Recognise all entities in sentences • Check if sentences contain subject, object of relations • Seed selection • Discover, then discard potentially noisy training data • Extract features • Standard features: context words, part of speech tags (noun, verb) etc. • Train classifier • Apply to hold-out part of corpus • Same relation identification procedure as for training data • Extracting relations across sentence boundaries • Integrate / combine results
  10. 10. 10 • Collect corpus Distant Supervision System • From Web, using search patterns containing relation • Relation identification • Recognise all entities in sentences • Check if sentences contain subject, object of relations • Seed selection • Discover, then discard potentially noisy training data • Extract features • Standard features • Train classifier • Apply to hold-out part of corpus • Same as for training data • Extracting relations across sentences • Integrate / combine results Research described in paper
  11. 11. 11 Evaluation: Corpus • Web crawl corpus, created using entity-specific search queries, consisting of 1 million Web pages Class Property / Relation Book author, characters, publication date, genre, original language Musical Artist album, active (start), active (end), genre, record label, origin, track Film release date, director, producer, language, genre, actor, character Politician birthdate, birthplace, educational institution, nationality, party, religion, spouses Class Property / Relation Organisation industry, employees, city, country, date founded, founders Educational Institution school type, mascot, colours, city, country, date founded River origin, mouth, length, basin countries, contained by
  12. 12. 12 Seed Selection Generating training data: is it that easy? Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. Name Album Track The Beatles Let It Be … … Let It Be …
  13. 13. 13 Seed Selection Generating training data: is it that easy? Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. Name Album Track The Beatles Let It Be … … Let It Be … • Use ‘Let It Be’ mentions as positive training examples for album or for track? • Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt • How can such ambiguous examples be detected? • Develop methods to detect, then automatically discard potentially ambiguous training data
  14. 14. 14 Seed Selection Ambiguity within an entity • Example: Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. • Let It Be can be both an album and a track of the musical artist The Beatles • For every relation, consisting of a subject, a property and an object (s, p, o), is the subject related to (at least) two different objects with the same lexicalisation which express two different relations? • Unam: • Retrieve the number of such senses using the Freebase API • Discard the lexicalisation of the object as positive training data if it has at least two different senses within an entity
  15. 15. 15 Seed Selection Ambiguity across classes • Example: common names of book authors or common genres, e.g. “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac. • Stop: remove common words that are stopwords • Stat: Estimate how ambiguous a lexicalisation of an object is compared to other lexicalisations of objects of the same relation • For every lexicalisation of an object of a relation, retrieve the number of senses using the Freebase API (example: for Jack n=1066) • Compute frequency distribution per relation with min, max, median (50th percentile), lower (25th percentile) and upper quartile (75th percentile) (example: for author: min=0, max=3059, median=10, lower=4, upper=32) • For every lexicalisation of an object of a relation, if the number of senses > upper quartile (or the lower quartile, or median, depending on the model), discard it (example: 1066 > 32 -> Jack will be discarded)
  16. 16. 16 • Seed selection Results / Key Findings • Statistical methods for discarding noisy training data improve precision e.g. Musical Artist: 0.62 -> 0.74; Politician: 0.85 -> 0.86 • Relation candidate recognition • Using additional methods to recognise named entities which do not rely on existing tools increases number of extractions • Information integration • Statistical methods for information integration improve results over simple combination Overall precision: Simple: 0.74, Strategic combination: 0.86 • Extracting across sentence boundaries • Improves precision as well as recall, up to 5 times the number of single extractions, on average twice as many extractions combined Overall precision: 0.8 -> 0.86
  17. 17. 17 Conclusions / Future Work • Distant supervision allows to automatically populate knowledge bases without manual effort • Distant supervision can be applied to any domain (focus of this work: Web data) • Seed selection, improved named entity recognition, strategies for information integration and extracting sentences across boundaries improve performance • Additional heuristics for named entity recognition work, but the approach still relies on existing tools for that Ø More work on unsupervised named entity recognition needed • Web pages do not only contain text, but also lists, tables etc. Ø more data that can be integrated
  18. 18. 18 Thank you for your attention! Questions?

×