Distant Supervision with Imitation Learning

Distant Supervision with
Imitation Learning
Isabelle Augenstein
i.augenstein@shefﬁeld.ac.uk
Department of Computer Science, University of Sheffield, UK
Joint work with Andreas Vlachos, Diana Maynard (EMNLP 2015)
30 November 2015
Heriot-Watt University Computer Science Seminar

2
Talk Overview
•  Relation Extraction from the Web with Distant Supervision
•  Extracting Relations from Web pages
•  Relation are used for populating Knowledge Bases
•  Distant Supervision allows to automatically generate relation extraction
training data using knowledge base
Ø  No manual effort necessary

3
Talk Overview
•  Imitation Learning for Distant Supervision
•  Relation extraction relies on recognising and classifying named entities,
but sentences only have relation annotations
•  Suitable manually labeled NERC training data can be difficult to obtain
•  Imitation Learning decomposes tasks (RE) into sequence of actions
(e.g. NEC, RE), able to deal with latent variables
•  Imitation Learning is a structured prediction method, also called
learning-to-search, inverse reinforcement learning
Ø  Only labels for last action (RE) needed, no additional manual effort

4
•  Large knowledge bases are useful for search, question
answering etc.
Overall Problem
Structured Information from
Google Knowledge Graph

5
answering etc. but far from complete
Overall Problem
Structured Information from
Google Knowledge Graph
Band members,
genre missing

6
answering etc. but far from complete
•  Approach: automatic knowledge base population (KBP)
methods using Web information extraction (IE)
1)  Extracting entities and relations between them from text on Web pages
2)  Combining information from several sources to populate KBs
Overall Problem

7
Relation extraction for knowledge base completion
•  Given subject and name of relation, find object of relation in corpus
•  E.g. “Where was Bill Gates born?”
•  Answer: birthplace(Bill Gates, Seattle_Washington)
Relation Extraction Overview
birthplace
Bill Gates was born in Seattle, Washington
LOC

8
•  Why distant supervision for relation extraction (RE)?
•  RE methods requiring manual effort
•  Rule-based approaches: manually created patters, e.g.
“X is a professor at Y”
•  Supervised learning: statistical models, manually annotated training data
Ø  Biased towards a domain, e.g. Biology, newswire, Wikipedia
•  RE methods requiring no manual effort
•  Bootstrapping: semi-supervised, learning patterns iteratively starting with
prior knowledge, e.g. list of names
Ø  “Semantic drift”, e.g. “X is a professor at Y” -> “X lives in Y”
•  Open Information Extraction: unsupervised learning, discovering
patterns, clustering
Ø  Difficult to map to schema
Existing Approaches

9
“If two entities participate in a relation, any sentence that contains those two
entities might express that relation.” (Mintz, 2009)
Amy Jade Winehouse was a
singer and songwriter known for
her eclectic mix of musical genres
including R&B, soul and jazz.
Blur helped to popularise the
Britpop genre.
Beckham rose to fame with the
all-female pop group Spice Girls.
Name Genre …
Amy Winehouse
Amy Jade Winehouse
Wino
…
R&B
soul
jazz
…
…
Blur
…
Britpop
…
…
Spice Girls
…
pop
…
…
different
lexicalisations
Distant Supervision

10
Creating positive &
negative training
examples
Feature
Extraction
Classifier
Training
Prediction of
New
Relations
Distant Supervision

11
Creating positive &
negative training
examples
Feature
Extraction
Classifier
Training
Prediction of
New
Relations
Distant Supervision
KB: album(The Beatles, Abbey Road)
Positive: The Beatles released their album Abbey Road
in 1969.
Negative: The Beatles played in Edinburgh.
depLemmaPath=released_OJB,
possPath=VBD_PRP_album, …
possPath=_release+VBN=0.354677
depLemmaPath=_release=1.81213, …
Michael Jackson’s third album is Music & Me
album(Michael Jackson, Music & Me)

12
Distant Supervision
Creating positive &
negative training
examples
Feature
Extraction
Classifier
Training
Prediction of
New
Relations
Supervised learning
Automatically generated
training data
+
Distant Supervision

13
•  Requires no manual effort
•  Automatically label text with relations from knowledge base
•  Train statistical model (not patterns)
•  Extract relations with respect to knowledge base
Ø  Combine benefits of supervised approaches (learn statistical
model) and bootstrapping RE approaches (only list of extractions
as input)
Distant Supervision

14
•  Web crawl corpus, created using entity-specific search
queries, e.g. “`The Beatles’ Musical Artist album”
Class Property / Relation
Book author, characters
Musical
Artist
album, record label, track
Film director, producer, actor,
character
Politician birthplace, educational
institution, spouse
Evaluation: Corpus
Business employees, founders
Educational
Institution
mascot, city
River origin, mouth

15
•  Distant Supervision does not require manual annotation but
depends on NERC for candidate identification
NERC for Distant Supervision
birthplace
Bill Gates was born in Seattle, Washington
LOC

16
•  Existing works use Stanford NER (Finkel et al. 2005) or
FIGER (Ling and Weld 2012)
Stanford NER FIGER
Location 14 Location (City, Country, County, Province, Railway, …)
Person 15 Person (Actor, Architect, Artist, Musician, Terrorist, …)
Organisation 13 Org (Airline, Company, Educational_Institution, ….)
Misc 13 Product (Car, Train, Camera, Software, Weapon, …)
9 Building (Airport, Hospital, Restaurant, Theater, …)
5 Art (Film, Play, Written_Work, Music, Newspaper)
7 Event (Election, Military_Conflict, Terrorist_Attack, …)
30 Misc (Time, Educational_Degree, Drug, Algorithm, …)

17
•  Problem 1: missing NE types even with fine-grained schemas
album
Musician ? Misc

18
•  Problem 1: missing NE types even with fine-grained schemas
•  Problem 2: domain difference between training and testing
data (e.g. newswire, Wikipedia vs. Web)
album
? Misc

19
•  Task decomposition
•  NER: Named Entity Boundary Recognition
•  NEC: Assigning Types to NEs
•  RE: Relation Extraction
•  Solution 1:
•  NER: recognise NEs with heuristics (e.g. POS-based, HTML)
•  NEC: apply trained model (e.g. Stanford, FIGER), add labels of objects
to RE features
•  RE: train model with distantly annotated data as usual
•  NER Heuristics:
•  Noun phrases, capitalised phrases
•  Phrases from HTML markup: <ahref>, <li>, <h1>, <h2>, <h3>,
<strong>, <b>, <em>, <i>

20
album
O
•  Solution 1:
•  NER: recognise NEs with heuristics (e.g. POS-based, HTML)
•  NEC: add object candidate labels (e.g. with Stanford, FIGER)
•  RE: train model with distantly annotated data as usual
•  RE features: ne=O, depLemmaPath=poss_album_subj,
possPath=POS_JJ_album_VBZ, …

21
•  Experiments with 16 relations (e.g. album, character, record
label, author, origin)
Recall of NER with off-the-shelf Stanford model compared to
heuristics

22
•  Solution 2:
•  NER: with heuristics
•  NEC & RE: train one-stage model
•  NEC features: obj=Music & Me, w[-1-2]=album is, …
•  RE features: depLemmaPath=poss_album_subj,
possPath=POS_JJ_album_VBZ, …
album

23
•  Solution 2:
•  NEC & RE: train one-stage model
•  Problem 3: NEC features useful for RE but
•  RE features are sparse (e.g. path between subject and object)
•  NEC features can overpower RE features
album

24
•  Problem 3: NEC features useful for RE but:
Ø  Model would incorrectly predict Stephen Spielberg,
because context is stronger (w[-1]=director)
One of director Stephen Spielberg’s greatest heroes
was Alfred Hitchcock, the mastermind behind
Psycho.
Candidates for director relation with subject Psycho:
Stephen Spielberg, Alfred Hitchcock

25
•  Ideal Solution:
•  NEC: trained classifier
•  RE: trained classifier
Ø  That would be great, but how can we do this without NEC
training data?

26
•  Imitation learning with DAGGER (Ross et al. 2011)
•  Also called learning-to-search, inverse reinforcement learning
•  Structured prediction method
•  Able to deal with latent variables, only labels for last stage (RE) needed
•  Decompose tasks into sequence of actions made at different stages
•  Dependencies between tasks are learnt by appropriate generation of
training examples
•  Classifiers are trained iteratively
•  Relationship between Reinforcement Learning and
Imitation learning
•  In reinforcement, the policy is being learnt and the actions are given
•  In imitation learning, the policy is given and the actions are learnt
•  (hence inverse)
Imitation Learning for Distant
Supervision

27
Supervision
•  Learning from demonstrator
•  Possible actions are given
•  Correctness of actions (i.e.
costs) are assessed by
taking actions, predicting
remaining ones and
evaluating result
•  Dependencies between
actions are learnt by
observation
•  Origins of Imitation learning
•  Robotics
•  Game playing (e.g. Ortega et al. 2012)
•  Mario’s possible actions (simplified): move left, move right,
duck, run, jump, fire

28
Supervision
•  Imitation Learning for NLP
•  Actions: NEC, if NEC positive followed by RE
•  Demonstrator (expert policy) tries to replicate labelled RE data
•  Base classifier: cost sensitive classification learning with PA
(passive-aggressive classifier)
•  NEC labels are needed but not specified by labelled RE data
•  Solution: look-ahead!

29
•  Iteration 1, NEC Stage
Supervision
True False Features
NEC Stage ? ? obj=Music & Me, …
RE Stage depLemma=poss_album_subj, …
?

30
•  Iteration 1, RE Stage
Supervision
True False Features
NEC Stage ? ? obj=Music & Me, …
RE Stage 0 1 depLemma=poss_album_subj, …
True
?

31
•  Iteration 1, RE Stage
Supervision
True False Features
NEC Stage 0 1 obj=Music & Me, …
RE Stage 0 1 depLemma=poss_album_subj, …
True
True

32
•  Iteration 1
•  NEC and RE Stage: predict labels according to labelled data
(expert policy) with look-ahead
•  Extract features
•  Assess costs
•  CSC example: features, costs -> will be remembered for next iterations!
•  Train classifier for each stage based on CSC example (learned policy)
Supervision

33
•  Iteration 1
•  NEC and RE Stage: predict labels according to labelled data
(expert policy) with look-ahead
•  Extract features
•  Assess costs
•  CSC example: features, costs -> will be remembered for next iterations!
•  Train classifier for each stage based on CSC example (learned policy)
•  Iteration >= 2
•  Predict labels according to expert policy or learned policy
•  Learned policy is chosen stochastically, i.e. p=(1−β)
i: number iteration, β: learning rate
•  With each iteration it is more likely that expert policy is chosen
•  The bigger the learning rate the faster learner moves away from labelled
data
Supervision
i-1

34
•  Reminder: Problem 3: NEC features useful for RE but:
Ø  Model would incorrectly predict Stephen Spielberg,
because context is stronger (w[-1]=director)
One of director Stephen Spielberg’s greatest heroes
was Alfred Hitchcock, the mastermind behind
Psycho.
Candidates for director relation with subject Psycho:
Stephen Spielberg, Alfred Hitchcock

35
•  Multi-stage modelling compensates for mistakes
Supervision
Confidence Prediction Features
NEC Stage 0.629 True obj=Stephen Spielberg, …
RE Stage -0.571 False depLemma=_POSS_heroes_ …
False
Steven Spielberg’s greatest heroes (…) Psycho
True

36
•  Multi-stage modelling compensates for mistakes
Supervision
True
Alfred Hitchcock, the mastermind behind Psycho
True
Confidence Prediction Features
NEC Stage 0.629 True obj=Alfred Hitchcock, …
RE Stage 0.571 True depLemma=_APPOS_mastermi
nd …

37
•  Web crawl corpus, created using entity-specific search
queries, e.g. “`The Beatles’ Musical Artist album”
Book author, characters
Musical
Artist
album, record label, track
Film director, producer, actor,
character
Politician birthplace, educational
institution, spouse
Evaluation: Corpus
Business employees, founders
Educational
Institution
mascot, city
River origin, mouth

38
•  Improving NEC for RE with Web Features
Evaluation: NEC Features
Arctic Monkeys
Arctic Monkeys are a rock band from Sheffield,
famous for albums such as AM.
Albums:
- Whatever People Say I Am, That's What I'm Not
- AM
header
link
bold
list

39
•  NEC:
•  Word features: Object occurrence, POS, digit and capitalisation
pattern etc.
•  Context features: 2 words to left and right: BOW, sequence, bag of
POS, POS sequence, as 1-grams and 2-grams
•  Web features
Ø  Best F1 and P-avg achieved with all of those
•  RE:
•  Context features (as for NEC)
•  POS and words between subject and object, as seq and BOW
•  Dependency path with/without lemmas
Ø  Best F1 and P-avg with sparse dependency features and 2-gram
context features
Evaluation: Features

40
Evaluation Setting
•  Models:
•  All models: NER with candidate identification heuristics (POS,
Web-based)
•  Rel only: one-stage, only relation features
•  Stanf: one-stage with Stanf NEC labels added to RE features
•  FIGER: one-stage with FIGER labels added to RE features
•  OS: one-stage with NEC features added to RE features
•  IL: two-stage with imitation learning

42
Conclusions EMNLP Experiments
•  Imitation learning approach outperforms baselines with
supervised NEC (Stanford NER and FIGER) by 10 points in
average precision
•  For NEC: Web features such as appearance in lists or links to
other Web improve average precision by 7 points
•  For RE: parse, high-precision features (such as parse)
outperform high-recall low-precision features (such as BOW
features)

43
Distant Supervision Challenges
•  Automatically generating training data
•  Can lead to noisy training examples
Let It Be is the twelfth album by
The Beatles which contains their
hit single Let It Be.
Name Album Track
The Beatles
…
Let It Be
…
Let It Be
…

44
•  Use ‘Let It Be’ mentions as positive training examples for album or for
track?
•  Problem: if both mentions of ‘Let It Be’ are used to extract features for
both album and track, wrong weights are learnt
Let It Be is the twelfth album by
The Beatles which contains their
hit single Let It Be.
Name Album Track
The Beatles
…
Let It Be
…
Let It Be
…

45
•  Evaluation
•  If training data is generated automatically, how / on what data can
approaches be evaluated?
•  Co-Reference Resolution
•  Does training / testing data have to contain names of subj and obj
directly?
•  Named Entity Recognition and Classification
•  Supervised off-the-shelf NERC approaches are not perfect (see rest of
talk)

46
Conclusions / Future Work
•  Distant supervision allows to automatically populate
knowledge bases without manual effort
•  Distant supervision can be applied to any domain
•  Ongoing challenges:
•  Reducing errors made by automatic labeling
•  Distant supervision with co-reference resolution
•  NERC for distant supervision

47
References
•  Isabelle Augenstein, Andreas Vlachos, Diana Maynard (2015).
Extracting Relations between Non-Standard Entities using Distant
Supervision and Imitation Learning. EMNLP 2015.
•  Isabelle Augenstein, Diana Maynard, Fabio Ciravegna (2015). Distantly
Supervised Web Relation Extraction for Knowledge Base Population.
Semantic Web Journal.
•  Isabelle Augenstein, Diana Maynard, Fabio Ciravegna (2014). Relation
Extraction from the Web using Distant Supervision. EKAW 2014, nominated
for best paper award.
•  Isabelle Augenstein (2014). Joint Information Extraction from the Web using
Linked Data. ISWC 2014.
•  Isabelle Augenstein (2014). Seed Selection for Distantly Supervised Web-
Based Relation Extraction. SWAIE Workshop at COLING 2014.

48
References
Distant Supervision:
•  Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant
supervision for relation extraction without labeled data. ACL- IJCNLP.
NERC:
•  Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005.
Incorporating Non-local Information into Information Extraction Systems by
Gibbs Sampling. ACL.
•  Xiao Ling and Daniel S. Weld. 2012. Fine-Grained Entity Recognition. AAAI.
Imitation Learning:
•  Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. 2011. A Reduction
of Imitation Learning and Structured Prediction to No-Regret Online
Learning. JMLR.
•  Juan Ortega, Noor Shaker, Julian Togelius and Georgios N. Yannakakis
(2013): Imitating human playing styles in Super Mario Bros. Entertainment
Computing, Elsevier.

49
Thank you for
your attention!
Questions?

Distant Supervision with Imitation Learning

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Distant Supervision with Imitation Learning

Similar to Distant Supervision with Imitation Learning (20)

More from Isabelle Augenstein

More from Isabelle Augenstein (17)

Recently uploaded

Recently uploaded (20)

Distant Supervision with Imitation Learning