Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Modeling Missing Data in Distant Supervision for Information Extraction
Alan Ritter (CMU)
Luke Zettlemoyer(University of Washington)
Mausam(University of Washington)
Oren Etzioni(Vulcan Inc.)
TACL, 1, 367-378, 2013.
Presented by NaoakiOkazaki (Tohoku University)
2014-09-05 Modeling Missing Data in Distant Supervision
1

Relation instance extraction
Steven Spielberg’s film Saving Private Ryan is loosely based on the brothers’ story.
Extractor
Film
Director
Saving Private Ryan
Steven Spielberg
Film-director relation
•
Fully-supervised learning (Zhou+ 05, …)
•
Uses ACE corpora to build relation-instance classifiers
•
Suffers from the limited number of training data
•
Unsupervised information extraction (Banko+ 07, …)
•
Extracts relational patterns between entities, and clusters the patterns into relations
•
Difficult to map clusters into relations of interest
•
Bootstrap learning (Brin98, …)
•
Uses seed instances to extract a new set of relational patterns
•
Often suffers from low precision (semantic drift)
•
Distant supervision (Mintz+ 09, …)
•
Combines the advantages of the above approaches
2

Distant supervision (Mintz+, 09)
Person
Birthplace
EdwinHubble
Marshfield
…
…
Automatic annotation
Astronomer Edwin Hubble was born in Marshfield, Missouri.
Feature extraction
Mintzet al. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.
* Each row presents a single feature. Concatenate features from different sentences containing the same entity pairs.
Problem: An entity pair cannot have multiple relations
E.g., Founded(Jobs, Apple) and CEO-of(Jobs, Apple) are true.
3

MultiR(Hoffmann+, 11)
Introduces latent variables (푧푧푖푖) to indicate the relation expressed by sentence 푥푥푖푖
0
1
1
0
Founder
Founder
CEO-of
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦capital−of
Steve Jobs was founder of Apple.
Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.
Steve Jobs is CEO of Apple.
푧푧1
푧푧2
푧푧3
푝푝풚풚,풛풛풙풙 = 1 푍푍푥푥 ෑ 푟푟 Φjoin(푦푦푟푟,풛풛)ෑ 푖푖 Φextract(푧푧푖푖,푥푥푖푖)
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
For entity pair, (Steve Jobs, Apple)
푥푥푖푖: a sentence containing the entity pair
푦푦푟푟∈{0,1}: 1if the knowledge base includes the pair with relation 푟푟, 0otherwise
푧푧푖푖∈푅푅: the relation expressed by sentence 푥푥푖푖
Φextract푧푧푖푖,푥푥푖푖=exp෍ 푗푗 휃휃푗푗휙휙푗푗(푧푧푖푖,푥푥푖푖)
Φjoin푦푦푟푟,풛풛=1(¬푦푦푟푟⋁∃푖푖: 푗푗=푧푧푖푖)
(Deterministic OR)
The same as (Mintz+ 09)
Φjoinensures that a sentence 푥푥푖푖expressing the relation 푟푟exists if 푟푟is true
Allows multiple relations for the same entity pair
4

MultiR: Training
Hoffmann et al. (2011) Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.
Loop for passes over the training data
Loop for entity pairs in the KB
Predict sentence-level and KB-level relations (ignoring the facts in the KB)
Find an optimal assignment of sentence-level relations consistent with the facts in KB
We need two kinds of inferences
Update feature weights similarly to the perceptron algorithm
2014-09-05 Modeling Missing Data in Distant Supervision 5

MultiR: Inference 1: argmax 풚풚,풛풛 푝푝(풚풚,풛풛|풙풙)
?
?
?
?
?
?
?
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦capital−of
푧푧1
푧푧2
푧푧3
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
0.5
16.0
9.0
0.1
8.0
11.0
6.0
0.1
7.0
8.0
7.0
0.2
born−in
founder
CEO−of
capita−of
Predict a relation label for each sentence independently
Aggregate sentence- level predictions into global-level predictions
6

MultiR: Inference 1: argmax 풚풚,풛풛 푝푝(풚풚,풛풛|풙풙)
0
1
0
0
founder
founder
founder
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦capital−of
푧푧1
푧푧2
푧푧3
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
0.5
16.0
9.0
0.1
8.0
11.0
6.0
0.1
7.0
8.0
7.0
0.2
born−in
founder
CEO−of
capita−of
Predict a relation label for each sentence independently
Aggregate sentence- level predictions into global-level predictions
Very easy to find!
Computational cost: 표표(푅푅풙풙)
7

MultiR: Inference 2: argmax 풛풛 푝푝(풛풛|풙풙,풚풚)
0
1
1
0
?
?
?
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦capital−of
푧푧1
푧푧2
푧푧3
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
0.5
16.0
9.0
0.1
8.0
11.0
6.0
0.1
7.0
8.0
7.0
0.2
born−in
founder
CEO−of
capita−of
0.5
8
7
16
11
8
9
6
7
0.1
0.1
0.2
Define an edge weight: w푦푦푟푟,푧푧푖푖=Φextract(푟푟,푥푥푖푖)
A node with 푦푦푟푟=1must have at least an edge connecting to 푧푧푖푖
Each node 푧푧푖푖must have an edge connecting to 푦푦푟푟
Find a set of edges that maximize the sum of weights
8

MultiR: Inference 2: argmax 풛풛 푝푝(풛풛|풙풙,풚풚)
0
1
1
0
founder
founder
CEO-of
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦capital−of
푧푧1
푧푧2
푧푧3
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
0.5
16.0
9.0
0.1
8.0
11.0
6.0
0.1
7.0
8.0
7.0
0.2
born−in
founder
CEO−of
capita−of
16
11
8
9
6
7
Define an edge weight: w푦푦푟푟,푧푧푖푖=Φextract(푟푟,푥푥푖푖)
A node with 푦푦푟푟=1must have at least an edge connecting to 푧푧푖푖
Each node 푧푧푖푖must have an edge connecting to 푦푦푟푟
Find a set of edges that maximize the sum of weights
Exact solution in polynomial time
In practice, approximate solution by greedy search (assigning 푧푧푖푖for each node 푦푦푟푟=1) is sufficient
9

Contribution of this work
•
MultiRmakes two assumptions (hard constraints):
•
If a fact is not found in the database, it cannot be mentioned in the text
•
If a fact is in the database, it must be mentioned in at least one sentence.
•
Relax MultiRto handle the situation where:
•
A fact is not mentioned in text (MIT)
•
A fact mentioned in text is missing in database (MID)
•
Side effect of this relaxation
•
Incorporates the tendency that the knowledge base is likely to include popular entities and relations
10

Distant Supervision with Data Not Missing at Random (DNMAR)
0
1
1
0
Founder
Founder
visit
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦visit
Steve Jobs visited Apple store…
푧푧1
푧푧2
푧푧3
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
0
1
0
1
풕풕
Introduce a layer of latent variables (푡푡푟푟) to handle missing cases
휙휙miss푦푦푟푟,푡푡푟푟 = −훼훼푀푀푀푀푀푀(푦푦푟푟=1⋀푡푡푟푟=0) (missingintext) −훼훼푀푀푀푀푀푀(푦푦푟푟=0⋀푡푡푟푟=1) (missinginDB) 0(otherwise)
Relaxing two hard constraints in MultiRinto soft oneswith penalty factors −훼훼푀푀푀푀푀푀and −훼훼푀푀푀푀푀푀
Introduce a new factor:
Training algorithm is the same as the one used in MultiR
11

Constrained inference: argmax 풛풛 푝푝(풛풛|풙풙,풚풚)
0
1
1
0
?
?
?
푦푦born−in
푦푦founder
푦푦CEO−of
푦푦visit
Steve Jobs visited Apple store…
푧푧1
푧푧2
푧푧3
푥푥1
푥푥2
푥푥3
풛풛
풙풙
풚풚
?
?
?
?
풕풕
푧푧∗=argmax 풛풛 ෍ 푖푖=1 푛푛 휃휃ȉΦextract푧푧푖푖,푥푥푖푖+෍ 푟푟 훼훼푀푀푀푀푇ȉ1(푦푦푟푟⋁∃푖푖:푟푟=푧푧푖푖)−훼훼푀푀푀푀퐷ȉ1(¬푦푦푟푟⋁∃푖푖:푟푟=푧푧푖푖)
Became more challenging
A* search can find an exact solution, but is not scalable with many variables
Present a greedy hill climbing approach for the inference:
1.
Initialize 푧푧푖푖at random
2.
Obtain neighborhoods of the current solution
3.
Move to the neighbor yielding the highest score
4.
Repeat this process
12

Incorporating popularity in KB
•
We tune the penalty factors 훼훼푀푀푀푀푀푀and 훼훼푀푀푀푀퐷on a development set
•
We can take into account how likely each fact is to be observed in the text and the knowledge base
•
Facts about Barack Obama are likelyto exist
•
Facts about NaoakiOkazaki are unlikelyto exists
•
Control the penalty factor for each entity pair
•
Popularity of entities: 훼훼푀푀푀푀푀푀 (푒푒1,푒푒2)=−훾훾min(푐푐푒푒1,푐푐(푒푒2))
•
A larger penalty if the model predicts that a fact about a popular entity does not exist in KB
•
Well-aligned relations: assign 3 kinds of values of 훼훼푀푀푀푀푇푟푟
•
A larger penalty if a popular relation such as contains, place_lived, and nationalitydoes not exist in text
13

Experiments
•
Binary relation extraction
•
The standard setting (Riedel+, 10)
•
Knowledge base: Freebase relations
•
Text corpus: 1.8m New York Times articles
•
Two kinds of evaluation
•
Sentence-level extractions using the dataset (Hoffmann+, 11)
•
Holdout evaluation on Freebase knowledge
•
Unary relation extraction (NE categorization)
•
Twitter NE categorization dataset (Ritter+, 11)
•
Knowledge base: Freebase (instances and their categories)
•
Text corpus: tweets
•
Hold-out evaluation
14

Results
17% increase in area under the curve.
Incorporating popularity yielded 27% increase over the baseline.
This evaluation underestimate precision because many facts correctly extracted from text are missing in the database.
DNMAR doubled the recall.
Ritter et al. (2013) Modeling Missing Data in Distant Supervision for Information Extraction, TACL(1), 367-378.
15

Conclusion
•
Investigated the problem of missing data in distant supervision
•
Presented an extension of MultiRto handle missing data
•
Could incorporate the popularity of facts to be included in the knowledge base and text
•
Presented a scalable inference algorithm based on greedy hill-climbing
•
Demonstrated the effectiveness of the modeling
16

References
•
Raphael Hoffmann, CongleZhang, Xiao Ling, Luke Zettlemoyer, Daniel S. Weld. (2011) Knowledge- Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.
•Slides
and
codes
•
Mike Mintz, Steven Bills, RionSnow, Dan Jurafsky. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.
17

Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (11)

Similar to Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Similar to Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013) (20)

Recently uploaded

Recently uploaded (20)

Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)