Relation Extraction

Seman&c
Analysis
in
Language
Technology

http://stp.lingﬁl.uu.se/~santinim/sais/2016/sais_2016.htm  
 
Relation Extraction
Marina
San(ni

san$nim@stp.lingﬁl.uu.se

Department
of
Linguis(cs
and
Philology

Uppsala
University,
Uppsala,
Sweden

Spring
2016

Previous
Lecture:
Ques$on
Answering

2

Ques$on
Answering
systems

•  Factoid
ques(ons:

•  Google

•  Wolfram

•  Ask
Jeeves

•  Start

•  ….

3

•  Approaches:

•  IR-‐based

•  Knowelege
based

•  Hybrid

Katz
et
al.
(2006)

hFp://start.csail.mit.edu/publica$ons/FLAIRS0601KatzB.pdf

•  START
answers
natural
language
ques(ons
by
presen(ng
components
of
text
and

mul(-‐media
informa(on
drawn
from
a
set
of
informa(on
resources
that
are

hosted
locally
or
accessed
remotely
through
the
Internet.

•  START
targets
high
precision
in
its
ques(on
answering.

•  The
START
system
analyzes
English
text
and
produces
a
knowledge
base
which

incorporates,
in
the
form
of
nested
ternary
expressions
(=triples),
the

informa(on
found
in
the
text.

4

Is
it

true?:
hFp://uncyclopedia.wikia.com/wiki/Ask_Jeeves

•  Ask
Jeeves,
more
correctly
known
as
Ask.com,
is
a
search
engine
founded

in
1996
in
California.

•  Ini(ally
it
represented
a
stereotypical
English
butler
who
would
"fetch"

the
answer
to
any
ques(on
asked.

•  Ask.com
is
now
considered
one
of
the
great
failures
of
the
internet.
The

ques(on
and
answer
feature
simply
didn't
work
as
well
as
hoped,
and

a^er
trying
his
hand
at
being
both
a
tradi(onal
search
engine
and
a

terrible
kind
of
"ar(ﬁcial
AI"
with
a
bald
spot,

•  These
days
Jeeves
is
ranked
as
the
4th
most
successful
search
engine
on

the
web,
and
the
4th
most
successful
overall.
This
seems
impressive
un$l

you
consider
that
Google
holds
the
top
spot
with
95%
of
the
market.
It

has
even
fallen
behind
Bing;
enough
said.

5

Search
engines
that
can
be
used
as
QA
systems

•  Yahoo

•  Bing

6

Siri

hFp://en.wikipedia.org/wiki/Siri

•  Siri
/ˈsɪri/
is
an
intelligent
personal
assistant
and
knowledge
navigator
which
works
as
an

applica(on
for
Apple
Inc.'s
iOS.

• 
The
applica(on
uses
a
natural
language
user
interface
to
answer
ques$ons,
make

recommenda(ons,
and
perform
ac(ons
by
delega$ng
requests
to
a
set
of
Web
services.

•  The
so^ware,
both
in
its
original
version
and
as
an
iOS
applica(on,
adapts
to
the
user's

individual
language
usage
and
individual
searches
(preferences)
with
con(nuing
use,
and

returns
results
that
are
individualized.

•  The
name
Siri
is
Scandinavian,
a
short
form
of
the
Norse
name
Sigrid
meaning
"beauty"
and

"victory",
and
comes
from
the
intended
name
for
the
original
developer's
ﬁrst
child.

7

ChaFerbots

•  Siri…
conversa(onal
”safety
net”.

•  Conversa(onal
agents
(chaker
bots,

and
personal
assistants)

àcustomer
care,
customer
analy(cs

(replacing/integra(ng
FAQs
and
help

desk)

8

Avatar: a picture of a person or animal that
represents you on a computer screen, for
example in some chat rooms or when you are
playing games over the Internet

Eliza

hFp://en.wikipedia.org/wiki/ELIZA

ELIZA
was
wriFen
at
MIT
by
Joseph
Weizenbaum
between
1964
and
1966

9

General
IR
architecture
for
factoid
ques$ons

10

Document
DocumentDocument
Docume
ntDocume
ntDocume
ntDocume
ntDocume
nt
Question
Processing
Passage
Retrieval
Query
Formulation
Answer Type
Detection
Question
Passage
Retrieval
Document
Retrieval
Answer
Processing
Answer
passages
Indexing
Relevant
Docs
DocumentDocument
Document

Things
to
extract
from
the
ques$on

•  Answer
Type
Detec(on

•  Decide
the
named
en$ty
type
(person,
place)

of
the
answer

•  Query
Formula(on

•  Choose
query
keywords
for
the
IR
system

•  Ques(on
Type
classiﬁca(on

•  Is
this
a
deﬁni(on
ques(on,
a
math
ques(on,
a

list
ques(on?

•  Focus
Detec(on

•  Find
the
ques(on
words
that
are
replaced
by

the
answer

•  Rela(on
Extrac(on

•  Find
rela(ons
between
en((es
in
the
ques(on
11

12

Common
Evalua$on
Metrics

1. Accuracy
(does
answer
match
gold-‐labeled
answer?)

2. Mean
Reciprocal
Rank:

•  The
reciprocal
rank
of
a
query
response
is
the
inverse
of
the
rank
of
the

ﬁrst
correct
answer.

•  The
mean
reciprocal
rank
is
the
average
of
the
reciprocal
ranks
of

results
for
a
sample
of
queries
Q

MRR =
1
rankii=1
N
∑
N
=

Common
Evalua$on
Metrics:
MRR

•  The
mean
reciprocal
rank
is
the
average
of
the
reciprocal
ranks

of
results
for
a
sample
of
queries
Q.

•  (ex
adapted
from
Wikipedia)

•  3
ranked
answers
for
a
query,
with
the
ﬁrst
one
being
the
one
it
thinks
is

most
likely
correct

•  Given
those
3
samples,
we
could
calculate
the
mean
reciprocal
rank
as

(1/3
+
1/2
+
1)/3
=
0.61.

13

Complex
ques$ons:
“What
is
the
‘hajii’”?

•  The
(bokom-‐up)
snippet
method

•  Find
a
set
of
relevant
documents

•  Extract
informa(ve
sentences
from
the
documents
(using
p-‐idf,
MMR)

•  Order
and
modify
the
sentences
into
an
answer

•  The
(top-‐down)
informa(on
extrac(on
method

•  build
specific
answerers
for
different
ques(on
types:

•  defini(on
ques(ons,

•  biography
ques(ons,

•  certain
medical
ques(ons

Informa$on
that
should
be
in
the
answer

for
3
kinds
of
ques$ons

Document
Retrieval
11 Web documents
1127 total
sentences
Predicate
Identification
Data-Driven
Analysis
383 Non-Specific Definitional sentences
Sentence clusters,
Importance ordering
Definition
Creation
9 Genus-Species Sentences
The Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam.
The Hajj is a milestone event in a Muslim's life.
The hajj is one of five pillars that make up the foundation of Islam.
...
The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than
two million Muslims are expected to take the Hajj this year. Muslims must perform
the hajj at least once in their lifetime if physically and financially able. The Hajj is a
milestone event in a Muslim's life. The annual hajj begins in the twelfth month of
the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes
in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in
the 12th month of the Islamic lunar calendar. Another ceremony, which was not
connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the
annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina…
"What is the Hajj?"
(Ndocs=20, Len=8)
Architecture
for
complex
ques$on
answering:

deﬁni$on
ques$ons
S.
Blair-‐Goldensohn,
K.
McKeown
and
A.
Schlaikjer.
2004.

Answering
Deﬁni(on
Ques(ons:
A
Hyrbid
Approach.

State-‐of-‐the-‐art:
ex

•  Top
downMing
Tan,
Cicero
dos
Santos,
Bing
Xiang
&
Bowen

Zhou.
2015.
LSTM-‐Based
Deep
Learning
Models
for
non
factoid

Answer
Selec(on.

•  Di
Wang
and
Eric
Nyberg.
2015.
A
Long
Short-‐Term
Memory

Model
for
Answer
Sentence
Selec(on
in
Ques(on
Answering.
In

ACL
2015.s

•  Minwei
Feng,
Bing
Xiang,
Michael
R.
Glass,
Lidan
Wang,
Bowen

Zhou.
2015.
Applying
deep
learning
to
answer
selec(on:
A
study

and
an
open
task.

17

Deep
Learning
is
a
new
area
of
Machine
Learning

research.
Said
to
be
very
promising.
It
is
about
learning

mul(ple
levels
of
representa(on
and
abstrac(on
that

help
to
make
sense
of
data
such
as
images,
sound,
and

text.
It
is
based
on
neural
networks.

Prac$cal
ac$vity

•  Start
seems
to
be
limited,
but
it
understands
natural
language

•  Google
(presumably
helped
by
Knowledge
Graph)
is
more

accurate,
but
skips
natural
language
(uses
keywords).

•  Google
is
customized
to
the
users’
preferences
(diﬀerent
results)

•  Interes(ng
outcomes

•  Currency
vs.
Coin

•  What’s
love?

•  Lyric/song
vs.
Deﬁni(on
ques(on

18

What’s
the
meaning
of
life?

•  Google

19

Presumably
from
Knowledge
Graph…

Start
and
the
42
puzzle

•  gg

20

End
of
previous
lecture

21

Acknowledgements
Most
slides
borrowed
or
adapted
from:

Dan
Jurafsky
and
Christopher
Manning,
Coursera

Dan
Jurafsky
and
James
H.
Mar(n
(2015)

J&M(2015,
dra^):
hkps://web.stanford.edu/~jurafsky/slp3/

Relation
Extraction
What
is
rela(on

extrac(on?

Extrac$ng
rela$ons
from
text

•  Company
report:
“Interna(onal
Business
Machines
Corpora(on
(IBM
or
the

company)
was
incorporated
in
the
State
of
New
York
on
June
16,
1911,
as
the

Compu(ng-‐Tabula(ng-‐Recording
Co.
(C-‐T-‐R)…”

•  Extracted
Complex
Rela(on:

Company-‐Founding

Company

IBM

Loca(on

New
York

Date

June
16,
1911

Original-‐Name

Compu(ng-‐Tabula(ng-‐Recording
Co.

•  But
we
will
focus
on
the
simpler
task
of
extrac(ng
rela(on
triples

Founding-‐year(IBM,1911)

Founding-‐loca(on(IBM,New
York)
24

Extrac$ng
Rela$on
Triples
from
Text

The
Leland
Stanford
Junior
University,

commonly
referred
to
as
Stanford

University
or
Stanford,
is
an
American

private
research
university
located
in

Stanford,
California
…
near
Palo
Alto,

California…
Leland
Stanford…founded

the
university
in
1891

Stanford EQ Leland Stanford Junior University
Stanford LOC-IN California
Stanford IS-A research university
Stanford LOC-NEAR Palo Alto
Stanford FOUNDED-IN 1891
Stanford FOUNDER Leland Stanford25

Why
Rela$on
Extrac$on?

•  Create
new
structured
knowledge
bases,
useful
for
any
app

•  Augment
current
knowledge
bases

•  Adding
words
to
WordNet
thesaurus,
facts
to
FreeBase
or
DBPedia

•  Support
ques(on
answering

•  The
granddaughter
of
which
actor
starred
in
the
movie
“E.T.”?

(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)!
•  But
which
rela(ons
should
we
extract?

!
26

Automated
Content
Extrac$on
(ACE)

ARTIFACT
GENERAL
AFFILIATION
ORG
AFFILIATION
PART-
WHOLE
PERSON-
SOCIAL
PHYSICAL
Located
Near
Business
Family Lasting
Personal
Citizen-
Resident-
Ethnicity-
Religion
Org-Location-
Origin
Founder
Employment
Membership
Ownership
Student-Alum
Investor
User-Owner-Inventor-
Manufacturer
Geographical
Subsidiary
Sports-Affiliation
“Relation Extraction Task”
27

Automa(c
Content
Extrac(on

(ACE)
is
a
research
program
for

developing
advanced
Informa(on

extrac(on
technologies.

Given
a
text
in
natural
language,

the
ACE
challenge
is
to
detect:

•  en((es

•  rela(ons
between
en((es

•  events

Automated
Content
Extrac$on
(ACE)

•  Physical-‐Located

PER-‐GPE

!He was in Tennessee!
•  Part-‐Whole-‐Subsidiary

ORG-‐ORG

XYZ, the parent company of ABC!
•  Person-‐Social-‐Family

PER-‐PER

John’s wife Yoko!
•  Org-‐AFF-‐Founder

PER-‐ORG

!Steve Jobs, co-founder of Apple…!
• 

28

UMLS:
Uniﬁed
Medical
Language
System

•  134
en(ty
types,
54
rela(ons

Injury

disrupts

Physiological
Func(on

Bodily
Loca(on

loca(on-‐of
Biologic
Func(on

Anatomical
Structure

part-‐of

Organism

Pharmacologic
Substance

causes

Pathological
Func(on

Pharmacologic
Substance

treats

Pathologic
Func(on

29

Extrac$ng
UMLS
rela$ons
from
a
sentence

Doppler echocardiography can be used to
diagnose left anterior descending artery
stenosis in patients with type 2 diabetes!
ê

Echocardiography,
Doppler
DIAGNOSES
Acquired
stenosis

30

Databases
of
Wikipedia
Rela$ons

31

Rela(ons
extracted
from
Infobox

Stanford
state
California

Stanford
moko
“Die
Lu^
der
Freiheit
weht”

…

Wikipedia
Infobox

Rela$on
databases

that
draw
from
Wikipedia

•  Resource
Descrip(on
Framework
(RDF)
triples

subject
predicate
object

Golden Gate Park location San Francisco!
dbpedia:Golden_Gate_Park

dbpedia-‐owl:loca(on

dbpedia:San_Francisco!
•  The
DBpedia
project
uses
the
Resource
Descrip(on
Framework
(RDF)
to
represent
the
extracted
informa(on
and

consists
of
3
billion
RDF
triples,
580
million
extracted
from
the
English
edi(on
of
Wikipedia
and
2.46
billion
from
other

language
edi(ons
(wikipedia,
March
2016).

•  Frequent
Freebase
rela(ons:

people/person/na(onality,

loca(on/loca(on/contains

people/person/profession,

people/person/place-‐of-‐birth

biology/organism_higher_classifica(on

film/film/genre

32

DBpedia
is
a
project
aiming
to
extract

structured
content
from
the
informa(on

created
as
part
of
the
Wikipedia
project.

Freebase
was
a
large

collabora(ve
knowledge
base

consis(ng
of
data
composed

mainly
by
its
community

members
(cf
Seman(c
Web).
-‐-‐>

Knowledge
Graph:

hkps://en.wikipedia.org/wiki/
Freebase

How
to
build
rela$on
extractors

1.  Hand-‐wriken
pakerns

2.  Supervised
machine
learning

3.  Semi-‐supervised
and
unsupervised

•  Bootstrapping
(using
seeds)

•  Distant
supervision

•  Unsupervised
learning
from
the
web

33

Relation
Extraction
Using
pakerns
to

extract
rela(ons

Rules
for
extrac$ng
IS-‐A
rela$on

Early
intui(on
from
Hearst
(1992)

•  “Agar
is
a
substance
prepared
from
a
mixture
of

red
algae,
such
as
Gelidium,
for
laboratory
or

industrial
use”

•  What
does
Gelidium
mean?

•  How
do
you
know?`

35

Rules
for
extrac$ng
IS-‐A
rela$on

Early
intui(on
from
Hearst
(1992)

•  “Agar
is
a
substance
prepared
from
a
mixture
of

red
algae,
such
as
Gelidium,
for
laboratory
or

industrial
use”

•  What
does
Gelidium
mean?

•  How
do
you
know?`

36

Hearst’s
PaFerns
for
extrac$ng
IS-‐A
rela$ons

(Hearst,
1992):

Automa(c
Acquisi(on
of
Hyponyms

“Y such as X ((, X)* (, and|or) X)”!
“such Y as X”!
“X or other Y”!
“X and other Y”!
“Y including X”!
“Y, especially X”!
37

Hearst’s
PaFerns
for
extrac$ng
IS-‐A
rela$ons

Hearst
paFern
Example
occurrences

X
and
other

Y
...temples,
treasuries,
and
other
important
civic
buildings.

X
or
other

Y
Bruises,
wounds,
broken
bones
or
other
injuries...

Y
such
as
X
The
bow
lute,
such
as
the
Bambara
ndang...

Such

Y
as
X
...such
authors
as
Herrick,
Goldsmith,
and
Shakespeare.

Y
including
X
...common-‐law
countries,
including
Canada
and
England...

Y
,
especially
X
European
countries,
especially
France,
England,
and
Spain...

38

Hand-‐built
paFerns
for
rela$ons

•  Plus:
•  Human patterns tend to be high-precision
•  Can be tailored to specific domains
•  Minus
•  Human patterns are often low-recall
•  A lot of work to think of all possible patterns!
•  Don’t want to have to do this for every relation!
•  We’d like better accuracy39

Relation
Extraction
Supervised
rela(on

extrac(on

Supervised
machine
learning
for
rela$ons

•  Choose
a
set
of
rela(ons
we’d
like
to
extract

•  Choose
a
set
of
relevant
named
en((es

•  Find
and
label
data

•  Choose
a
representa(ve
corpus

•  Label
the
named
en((es
in
the
corpus

•  Hand-‐label
the
rela(ons
between
these
en((es

•  Break
into
training,
development,
and
test

•  Train
a
classiﬁer
on
the
training
set

41

How
to
do
classiﬁca$on
in
supervised

rela$on
extrac$on

1.  Find
all
pairs
of
named
en((es
(usually
in
same
sentence)

2.  Decide
if
2
en((es
are
related

3.  If
yes,
classify
the
rela(on

•  Why
the
extra
step?

•  Faster
classiﬁca(on
training
by
elimina(ng
most
pairs

•  Can
use
dis(nct
feature-‐sets
appropriate
for
each
task.

42

Word
Features
for
Rela$on
Extrac$on

•  Headwords
of
M1
and
M2,
and
combina(on

Airlines

Wagner

Airlines-‐Wagner

•  Bag
of
words
and
bigrams
in
M1
and
M2

{American,
Airlines,
Tim,
Wagner,
American
Airlines,
Tim
Wagner}

•  Words
or
bigrams
in
par(cular
posi(ons
le^
and
right
of
M1/M2

M2:
-‐1
spokesman

M2:
+1
said

•  Bag
of
words
or
bigrams
between
the
two
en((es

{a,
AMR,
of,
immediately,
matched,
move,
spokesman,
the,
unit}

American
Airlines,
a
unit
of
AMR,
immediately
matched
the
move,
spokesman
Tim
Wagner
said

Men(on
1
Men(on
2

43

Named
En$ty
Type
and
Men$on
Level

Features
for
Rela$on
Extrac$on

•  Named-‐en(ty
types

•  M1:

ORG

•  M2:

PERSON

•  Concatena(on
of
the
two
named-‐en(ty
types

•  ORG-‐PERSON

•  En(ty
Level
of
M1
and
M2

(NAME,
NOMINAL,
PRONOUN)

•  M1:
NAME

[it

or
he
would
be
PRONOUN]

•  M2:
NAME

[the
company

would
be
NOMINAL]

American
Airlines,
a
unit
of
AMR,
immediately
matched
the
move,
spokesman
Tim
Wagner
said

Men(on
1
Men(on
2

44

Parse
Features
for
Rela$on
Extrac$on

•  Base
syntac(c
chunk
sequence
from
one
to
the
other

NP

NP

PP

VP

NP

NP

•  Cons(tuent
path
through
the
tree
from
one
to
the
other

NP

é NP

é

S

é S

ê NP

•  Dependency
path

Airlines

matched

Wagner

said

American
Airlines,
a
unit
of
AMR,
immediately
matched
the
move,
spokesman
Tim
Wagner
said

Men(on
1
Men(on
2

45

American
Airlines,
a
unit
of
AMR,
immediately

matched
the
move,
spokesman
Tim
Wagner
said.
46

Classiﬁers
for
supervised
methods

•  Now
you
can
use
any
classiﬁer
you
like

•  MaxEnt

•  Naïve
Bayes

•  SVM

•  ...

•  Train
it
on
the
training
set,
tune
on
the
dev
set,
test
on
the
test

set

47

Evalua$on
of
Supervised
Rela$on

Extrac$on

•  Compute
P/R/F1
for
each
rela(on

48

P =
# of correctly extracted relations
Total # of extracted relations
R =
# of correctly extracted relations
Total # of gold relations
F1 =
2PR
P + R

Summary:
Supervised
Rela$on
Extrac$on

+

Can
get
high
accuracies
with
enough
hand-‐labeled

training
data,
if
test
similar
enough
to
training

-‐

Labeling
a
large
training
set
is
expensive

-‐

Supervised
models
are
brikle,
don’t
generalize
well

to
diﬀerent
genres

49

Relation
Extraction
Semi-‐supervised

and
unsupervised

rela(on
extrac(on

Seed-‐based
or
bootstrapping
approaches

to
rela$on
extrac$on

•  No
training
set?
Maybe
you
have:

•  A
few
seed
tuples

or

•  A
few
high-‐precision
pakerns

•  Can
you
use
those
seeds
to
do
something
useful?

•  Bootstrapping:
use
the
seeds
to
directly
learn
to
populate
a

rela(on

51

Roughly
said:
Use
seeds
to
ini(alize
a

process
of
annota(on,
then
reﬁne

through
itera(ons

Rela$on
Bootstrapping
(Hearst
1992)

•  Gather
a
set
of
seed
pairs
that
have
rela(on
R

•  Iterate:

1.  Find
sentences
with
these
pairs

2.  Look
at
the
context
between
or
around
the
pair
and

generalize
the
context
to
create
pakerns

3.  Use
the
pakerns
for
grep
for
more
pairs

52

Bootstrapping

•  <Mark
Twain,
Elmira>

Seed
tuple

•  Grep
(google)
for
the
environments
of
the
seed
tuple

“Mark
Twain
is
buried
in
Elmira,
NY.”

X
is
buried
in
Y

“The
grave
of
Mark
Twain
is
in
Elmira”

The
grave
of
X
is
in
Y

“Elmira
is
Mark
Twain’s
ﬁnal
res(ng
place”

Y
is
X’s
ﬁnal
res(ng
place.

•  Use
those
pakerns
to
grep
for
new
tuples

•  Iterate
53

Dipre:
Extract
<author,book>
pairs

•  Start
with
5
seeds:

•  Find
Instances:

The
Comedy
of
Errors,
by

William
Shakespeare,
was

The
Comedy
of
Errors,
by

William
Shakespeare,
is

The
Comedy
of
Errors,
one
of
William
Shakespeare's
earliest
akempts

The
Comedy
of
Errors,
one
of
William
Shakespeare's
most

•  Extract
pakerns
(group
by
middle,
take
longest
common
prefix/suffix)

?x , by ?y , ?x , one of ?y ‘s !
•  Now
iterate,
finding
new
seeds
that
match
the
pakern

!
Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.
Author
Book

Isaac
Asimov
The
Robots
of
Dawn

David
Brin
Star(de
Rising

James
Gleick
Chaos:
Making
a
New
Science

Charles
Dickens
Great
Expecta(ons

William
Shakespeare
The
Comedy
of
Errors

54

Distant
Supervision

•  Combine
bootstrapping
with
supervised
learning

•  Instead
of
5
seeds,

•  Use
a
large
database
to
get
huge
#
of
seed
examples

•  Create
lots
of
features
from
all
these
examples

•  Combine
in
a
supervised
classiﬁer

Snow,
Jurafsky,
Ng.
2005.
Learning
syntac(c
pakerns
for
automa(c
hypernym
discovery.
NIPS
17

Fei
Wu
and
Daniel
S.
Weld.
2007.

Autonomously
Seman(fying
Wikipeida.
CIKM
2007

Mintz,
Bills,
Snow,
Jurafsky.
2009.
Distant
supervision
for
rela(on
extrac(on
without
labeled
data.
ACL09

55

Distant
supervision
paradigm

•  Like
supervised
classifica(on:

•  Uses
a
classifier
with
lots
of
features

•  Supervised
by
detailed
hand-‐created
knowledge

•  Doesn’t
require
itera(vely
expanding
pakerns

•  Like
unsupervised
classifica(on:

•  Uses
very
large
amounts
of
unlabeled
data

•  Not
sensi(ve
to
genre
issues
in
training
corpus

56

Distantly
supervised
learning

of
rela$on
extrac$on
paFerns

For
each
rela(on

For
each
tuple
in
big
database

Find
sentences
in
large
corpus

with
both
en((es

Extract
frequent
features

(parse,
words,
etc)

Train
supervised
classifier
using

thousands
of
pakerns

4
1
2
3
5
PER
was
born
in
LOC

PER,
born
(XXXX),
LOC

PER’s
birthplace
in
LOC

<Edwin
Hubble,
Marshfield>

<Albert
Einstein,
Ulm>

Born-‐In

Hubble
was
born
in
Marshfield

Einstein,
born
(1879),

Ulm

Hubble’s
birthplace
in
Marshfield

P(born-in | f1,f2,f3,…,f70000)57

Unsupervised
rela$on
extrac$on

•  Open
InformaLon
ExtracLon:

•  extract
rela(ons
from
the
web
with
no
training
data,
no
list
of
rela(ons

1.  Use
parsed
data
to
train
a
“trustworthy
tuple”
classiﬁer

2.  Single-‐pass
extract
all
rela(ons
between
NPs,
keep
if
trustworthy

3.  Assessor
ranks
rela(ons
based
on
text
redundancy

(FCI,
specializes
in,
so^ware
development)

(Tesla,
invented,
coil
transformer)

58

M.
Banko,
M.
Cararella,
S.
Soderland,
M.
Broadhead,
and
O.
Etzioni.

2007.
Open
informa(on
extrac(on
from
the
web.
IJCAI

Evalua$on
of
Semi-‐supervised
and

Unsupervised
Rela$on
Extrac$on

•  Since
it
extracts
totally
new
rela(ons
from
the
web

•  There
is
no
gold
set
of
correct
instances
of
rela(ons!

•  Can’t
compute
precision
(don’t
know
which
ones
are
correct)

•  Can’t
compute
recall
(don’t
know
which
ones
were
missed)

•  Instead,
we
can
approximate
precision
(only)

• 
Draw
a
random
sample
of
rela(ons
from
output,
check
precision
manually

•  Can
also
compute
precision
at
diﬀerent
levels
of
recall.

•  Precision
for
top
1000
new
rela(ons,
top
10,000
new
rela(ons,
top
100,000

•  In
each
case
taking
a
random
sample
of
that
set

•  But
no
way
to
evaluate
recall
59

ˆP =
# of correctly extracted relations in the sample
Total # of extracted relations in the sample

Relation Extraction

More Related Content

What's hot

Similar to Relation Extraction

More from Marina Santini

Recently uploaded

Relation Extraction