The Web is inundated with information in many different formats including semi-structured and unstructured data. Machine Reading is a research area that aims to build systems that can read natural-language-based information, extracting knowledge and storing it into knowledge bases. Thus, Machine Reading systems are developed to produce language- understanding technology that will automatically process text in affordable time. In this tutorial the idea of automatically reading the Web using Machine Reading techniques will be explored. Four of the most successful Machine Reading approaches in- tended to Read the Web (namely KnowItAll, Yago, NELL and DBPedia systems) will be presented and discussed. The principles, the subtleties, as well as current results of each approach will be addressed. On-line resources (from each approach) will be explored and the future directions in each system will be pointed out. YAGO, KnowItAll, NELL and DBPedia are not the only research efforts focusing on Reading the Web. They were selected, to be presented in this tutorial, because they show four different and very relevant approaches to this problem, but it does not mean they are the only relevant approaches at all. In spite of mainly focusing on the four aforementioned systems, some other independent contributions on the Read the Web idea will be mentioned and pointed out as related works.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Machine Reading the Web: beyond Named Entity Recognition and Relation Extraction
1. Estevam
R.
Hruschka
Jr.
Federal
University
of
São
Carlos
Machine Reading the Web:
Beyond Named Entity
Recognition and Relation
Extraction
2. Disclaimers
• Previous
versions
of
this
tutorial
were
presented
at
IBERAMIA2012
(h@p://iberamia2012.dsic.upv.es/tutorials/)
and
WWW2013
(h@p://www2013.org/program/machine-‐
reading-‐the-‐web/).
Also,
a
short
version
was
presented
at
ECMLPKDD2015
Summer
School(h@p://
www.ecmlpkdd2015.org/summer-‐school/ss-‐schedule).
• Feel
free
to
e-‐mail
me
(estevam.hruschka@gmail.com)
with
quesTons
about
this
tutorial
or
any
feedback/suggesTons/
criTcisms.
Your
feedback
can
help
improving
the
quality
of
these
slides,
thus,
they
are
very
welcome.
• As
in
many
tutorials’
slides,
these
slides
were
prepared
to
be
presented,
and
la@er
studied.
Thus,
they
are
meant
to
be
more
self-‐contained
than
slides
from
a
paper
presentaTon.
3. Disclaimers
• Due
to
Tme
constraints,
I
do
not
intend
to
cover
all
the
algorithms
and
publicaTons
related
to
YAGO,
KnowItAll,
NELL
and
DBPedia.
What
I
do
intend,
instead,
is
to
give
an
overview
of
all
four
projects
and
what
is
the
main
approach
to
“Read
the
Web”,
used
in
each
project.
• YAGO,
KnowItAll,
NELL
and
DBPedia
are
not
the
only
research
efforts
focusing
on
“Reading
the
Web”.
They
were
selected,
to
be
presented
in
this
tutorial,
because
they
represent
four
different
and
very
relevant
approaches
to
this
problem,
but
it
does
not
mean
they
are
the
best
(or
the
only
relevant)
ones
at
all.
4. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– YAGO
– KnowItAll
– NELL
– DBPedia
5. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– YAGO
– KnowItAll
– NELL
– DBPedia
25. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– DBPedia
– YAGO
– KnowItAll
– NELL
26. Machine
Learning
• What
is
Machine
Learning?
The
field
of
Machine
Learning
seeks
to
answer
the
quesTon
“How
can
we
build
computer
systems
that
automaTcally
improve
with
experience,
and
what
are
the
fundamental
laws
that
govern
all
learning
processes?”
[Mitchell,
2006]
27. Machine
Learning
• What
is
Machine
Learning?
a
machine
learns
with
respect
to
a
parTcular:
-‐ task
T
-‐ performance
metric
P
-‐ type
of
experience
E
if
the
system
reliably
improves
its
performance
P
at
task
T,
following
experience
E.
[Mitchell,
1997]
28. Machine
Learning
• Examples
of
Machine
Learning
approaches
for
different
tasks
(T),
performance
metrics
(P)
an
experiences
(E)
-‐ data
mining
-‐ autonomous
discovery
-‐ database
updaTng
-‐ programming
by
example
-‐ Pa@ern
recogniTon
62. 0
5
10
15
20
25
0
5
10
15
20
25
Series1
Series2
Unlabeled
Semi-‐supervised
Learning
(one
simple
anecdotal
approach)
?????????
What
model
should
be
chosen?
?????????
63. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– DBPedia
– YAGO
– KnowItAll
– NELL
64. Machine
Reading
• “The
autonomous
understanding
of
text”
[Etzioni
et
al.,
2007]
• “One
of
the
most
important
methods
by
which
human
beings
learn
is
by
reading”
[Clark
et
al.,
2007],
thus
why
not
building
machines
capable
of
learning
by
reading?
65. Machine
Reading
• “The
problem
of
deciding
what
was
implied
by
a
wri@en
text,
of
reading
between
the
lines
is
the
problem
of
inference.”
[Norvig,
2007]
• Typically,
Machine
Reading
is
different
from
Natural
Language
Processing
alone
66. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
This
slide
was
adapted
from
[Hady
et
al.,
2011]
67. Machine
Reading
same
This
slide
was
adapted
from
[Hady
et
al.,
2011]
It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
68. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
same
same
same
same
same
same
This
slide
was
adapted
from
[Hady
et
al.,
2011]
69. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
same
same
same
same
same
same
uncleOf
owns
hires
headOf
This
slide
was
adapted
from
[Hady
et
al.,
2011]
70. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
same
same
same
same
same
same
uncleOf
owns
hires
headOf
affairWith
affairWith
enemyOf
This
slide
was
adapted
from
[Hady
et
al.,
2011]
71. Machine
Reading
• One
important
(ini6al)
approach
to
machine
reading
is
to
extract
facts
from
text
and
store
them
in
a
structured
form.
• Facts
can
be
seen
as
enTTes
and
their
relaTons
• Ontology
is
one
of
the
most
common
representaTon
for
the
extracted
facts
80. Machine
Reading
• Named
EnTty
ResoluTon/RecogniTon
• RelaTon
ExtracTon
• Co-‐reference
and
Polysemy
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
• Document/Sentence
Understanding
(Micro-‐
Reading)
81. Machine
Reading
• Named
EnTty
ResoluTon/RecogniTon
• RelaTon
ExtracTon
• Co-‐reference
and
Polysemy
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
• Document/Sentence
Understanding
(Micro-‐
Reading)
82. Machine
Reading
• Named
EnTty
ResoluTon/RecogniTon
– Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
• Wikipedia
infoboxes
&
categories
• HMTL
lists
&
tables,
etc.
– Free
text
• Hearst-‐pa@erns;
clustering
by
verbal
phrases
• Natural-‐language
processing
• Advanced
pa@erns
&
iteraTve
bootstrapping
(“Dual
IteraTve
Pa@ern
RelaTon
ExtracTon”)
83. Named
EnTty
RecogniTon
• Named
EnTty
RecogniTon
[Nadeau
&
Sekine,
2007]
– term
“Named
EnTty”
coined
for
the
Sixth
Message
Understanding
Conference
(MUC-‐6)
(R.
Grishman
&
Sundheim
1996).
– important
sub-‐tasks
of
IE
called
“Named
EnTty
RecogniTon
and
ClassificaTon
(NERC)”.
84. • recognize
informaTon
units
like
names,
including
person,
organizaAon
and
locaAon
names,
and
numeric
expressions
including
Ame,
date,
money
and
percent
expressions.
• In
Machine
Reading,
many
other
enTTes:
product,
kitchen
item,
sport,
etc.
Named
EnTty
RecogniTon
[Nadeau
&
Sekine,
2007]
85. • Named
EnTty
ResoluTon
[Theobald
&
Weikum,
2012]
– Which
individual
enTTes
belong
to
which
classes?
• instanceOf
(Surajit
Chaudhuri,
computer
scien6sts),
• instanceOf
(BarbaraLiskov,
computer
scien6sts),
• instanceOf
(Barbara
Liskov,
female
humans),
…
Named
EnTty
ResoluTon
86. • Named
EnTTes
RecogniTon
as
a
machine
learning
task.
– Supervised
Learning
NLP
tools
(POS,
Parse
Trees)
text
Features
ExtracTon
Classifier
Named
EnTty
RecogniTon
87. • Named
EnTty
RecogniTon
as
a
Machine
Learning
task.
– Supervised
Learning
– Possible
features
[RaTnov
&
Roth,
2009],
[Khambhatla,
2004],
[Zhou
et.
al.
2005]
•
Words
“around”
and
including
enTTes
• POS
(Part-‐Of-‐Speech)
• Prefixes
and
suffixes
• CapitalizaTon
• Number
of
words
• Number
of
characters
• First
word,
last
word
• gaze@eer
matches
Named
EnTty
RecogniTon
88. • Supervised
Learning
NLP
tools
(POS,
Parse
Trees)
text
Features
ExtracTon
Classifier
Named
EnTty
RecogniTon
89. • Supervised
Learning
NLP
tools
(POS,
Parse
Trees)
text
Features
ExtracTon
Classifier
Kernels
Named
EnTty
RecogniTon
90. • Supervised
Learning
using
Kernels
– A
Kernel
defines
similarity
implicitly
in
a
higher
dimensional
space
– Can
be
based
on
Strings,
Word
Sequences,
Parse
Trees,
etc.
• For
strings
similarity∝
number
of
common
substrings
(or
subsequences)
• Recommended
reading
on
string
kernels
[Lodhi
et.
al.,
2002]
Named
EnTty
RecogniTon
[Bach
&
Badaskar,
2007]
92. • Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
Set
of
labeled
Pa@ern
Examples
Named
EnTty
RecogniTon
93. • Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
is
the
CEO
of
X
Named
EnTty
RecogniTon
94. • Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
is
the
CEO
of
X
Named
EnTty
RecogniTon
95. Set
of
labeled
Instances
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
is
the
CEO
of
X
Named
EnTty
RecogniTon
96. Set
of
labeled
Instances
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
is
the
CEO
of
X
Named
EnTty
RecogniTon
Google
Apple
97.
Set
of
labeled
Instances
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
is
the
CEO
of
X
Named
EnTty
RecogniTon
Google
Apple
NE
Pa@ern
Classifier
98.
Set
of
labeled
Instances
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
is
the
CEO
of
X
Named
EnTty
RecogniTon
Google
Apple
NE
Pa@ern
Classifier
What
about
unsupervised?
99.
Set
of
labeled
Instances
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
Named
EnTty
RecogniTon
NE
Pa@ern
Classifier
What
about
unsupervised?
100.
Set
of
labeled
Instances
• Unsupervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
NE
instances.
NE
Instances
Classifier
Set
of
labeled
Pa@ern
Examples
Named
EnTty
RecogniTon
NE
Pa@ern
Classifier
101. • [RaTnov
&
Roth,
2009]
Named
EnTty
RecogniTon
103. Machine
Reading
• Named
EnTty
ResoluTon/ExtracTon
• RelaTon
ExtracTon
• Co-‐reference
and
Polysemy
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
RepresentaTon
• Document/Sentence
Understanding
(Micro-‐
Reading)
104. Machine
Reading
• RelaTon
ExtracTon
– Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
• Wikipedia
infoboxes
&
categories
• HMTL
lists
&
tables,
etc.
– Free
text
• Hearst-‐pa@erns;
clustering
by
verbal
phrases
• Natural-‐language
processing
• Advanced
pa@erns
&
iteraTve
bootstrapping
(“Dual
IteraTve
Pa@ern
RelaTon
ExtracTon”)
105. Machine
Reading
• RelaTon
ExtracTon
[Theobald
&
Weikum,
2012]
– Which
instances
(pairs
of
individual
enTTes)
are
there
for
given
binary
relaTons
with
specific
type
signatures?
• hasAdvisor
(JimGray,
MikeHarrison)
• hasAdvisor
(HectorGarcia-‐Molina,
Gio
Wiederhold)
• hasAdvisor
(Susan
Davidson,
Hector
Garcia-‐Molina)
• graduatedAt
(JimGray,
Berkeley)
• graduatedAt
(HectorGarcia-‐Molina,
Stanford)
• hasWonPrize
(JimGray,
TuringAward)
• bornOn
(JohnLennon,
9Oct1940)
• diedOn
(JohnLennon,
8Dec1980)
• marriedTo
(JohnLennon,
YokoOno)
106. RelaTon
ExtracTon
• ExtracTng
semanTc
relaTons
between
enTTes
in
text
• RelaTon
extracTon
as
a
Machine
Learning
task.
– Supervised
Learning
NLP
tools
(POS,
Parse
Trees)
text
Features
ExtracTon
Classifier
[Bach
&
Badaskar,
2007]
107. RelaTon
ExtracTon
• RelaTon
extracTon
as
a
Machine
Learning
task.
– Supervised
Learning
– Possible
features
[Khambhatla,
2004],
[Zhou
et.
al.
2005]
•
Words
between
and
including
enTTes
• Types
of
enTTes
(person,
locaTon,
etc)
• Number
of
enTTes
between
the
two
enTTes,
whether
both
enTTes
belong
to
same
chunk
• #
words
separaTng
the
two
enTTes
• Path
between
the
two
enTTes
in
a
parse
tree
[Bach
&
Badaskar,
2007]
108. RelaTon
ExtracTon
• ExtracTng
semanTc
relaTons
between
enTTes
in
text
• RelaTon
extracTon
as
a
classificaTon
task.
– Supervised
Learning
NLP
tools
(POS,
Parse
Trees,
NER)
text
Features
ExtracTon
Classifier
[Bach
&
Badaskar,
2007]
109. RelaTon
ExtracTon
• ExtracTng
semanTc
relaTons
between
enTTes
in
text
• RelaTon
extracTon
as
a
classificaTon
task.
– Supervised
Learning
NLP
tools
(POS,
Parse
Trees,
NER)
text
Features
ExtracTon
Classifier
Kernels
[Bach
&
Badaskar,
2007]
110. RelaTon
ExtracTon
• Supervised
Learning
using
Kernels
– A
Kernel
defines
similarity
implicitly
in
a
higher
dimensional
space
– Can
be
based
on
Strings,
Word
Sequences,
Parse
Trees,
etc.
• For
strings,
similarity∝
number
of
common
substrings
(or
subsequences)
• Recommended
reading
on
string
kernels
[Lodhi
et.
al.,
2002]
[Bach
&
Badaskar,
2007]
112. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
Pa@ern
Examples
113. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
Y
Y
is
the
headquarter
of
X
114. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
Y
Y
is
the
headquarter
of
X
Pair
of
Instances
Classifier
115. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
Y
Y
is
the
headquarter
of
X
Pair
of
Instances
Classifier
116. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Pair
of
Instances
Classifier
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
X
is
headquartered
in
Y
Y
is
the
headquarter
of
X
Google-‐Mountain
View
Apple-‐CuperAno
117. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
X
is
headquartered
in
Y
Y
is
the
headquarter
of
X
Google-‐Mountain
View
Apple-‐CuperAno
Pair
of
Instances
Classifier
118. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
X
is
headquartered
in
Y
Y
is
the
headquarter
of
X
Google-‐Mountain
View
Apple-‐CuperAno
Pair
of
Instances
Classifier
What
about
unsupervised?
119. RelaTon
ExtracTon
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
Pair
of
Instances
Classifier
What
about
unsupervised?
120. RelaTon
ExtracTon
• Unsupervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
Pair
of
Instances
Classifier
122. Machine
Reading
• Named
EnTty
ResoluTon/ExtracTon
• RelaTon
ExtracTon
• Co-‐reference
and
Polysemy
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
RepresentaTon
• Document/Sentence
Understanding
(Micro-‐
Reading)
123. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
Example
(figure)
taken
from:
h@p://nlp.stanford.edu/projects/coref.shtml
124. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
Example
(figure)
taken
from:
h@p://nlp.stanford.edu/projects/coref.shtml
within-document co-reference
125. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
Example
(figure)
taken
from:
h@p://nlp.stanford.edu/projects/coref.shtml
within-document co-reference
126. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
Example
(figure)
adapted
from
[Krishnamurthy
&
Mitchell,
2011]
apple
computer
Apple
Computer
127. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
Example
(figure)
adapted
from
[Krishnamurthy
&
Mitchell,
2011]
apple
apple
computer
Apple
Computer
128. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
Example
(figure)
adapted
from
[Krishnamurthy
&
Mitchell,
2011]
apple
apple
computer
Apple
Computer
cross-document co-reference
129. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference:
expressions
that
refer
to
the
same
enTty
• Which
names
denote
which
enTTes?
[Theobald
&
Weikum,
2012]
– means
(“Lady
Di“,
Diana
Spencer),
– means
(“Diana
Frances
Mountba@en-‐Windsor”,
Diana
Spencer),
…
– means
(“Madonna“,
Madonna
Louise
Ciccone),
– means
(“Madonna“,
Madonna(painTng
by
Edward
Munch)),
…
cross-document co-reference
130. Co-‐Reference
and
Polysemy
ResoluTon
• Polysemy:
is
the
capacity
for
a
sign
(such
as
a
word,
phrase,
or
symbol)
to
have
mulTple
meanings
[Wikipedia]
131. Co-‐Reference
and
Polysemy
ResoluTon
• Polysemy:
is
the
capacity
for
a
sign
(such
as
a
word,
phrase,
or
symbol)
to
have
mulTple
meanings
[Wikipedia]
Example
(figure)
adapted
from
[Krishnamurthy
&
Mitchell,
2011]
apple
apple
(the
fruit)
Apple
Computer
132. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐Reference
and
Polysemy
Example
(figure)
adapted
from
[Krishnamurthy
&
Mitchell,
2011]
apple
apple
computer
apple
(the
fruit)
Apple
Computer
133. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference
and
Polysemy:
– Supervised
Learning
NLP
tools
(POS,
Parse
Trees)
text
Features
ExtracTon
Classifier
134. • Co-‐Reference
ResoluTon.
– Supervised
Learning
– Possible
features
[Bengtson
&
Roth,
2008]
Co-‐Reference
and
Polysemy
ResoluTon
135. • Co-‐Reference
ResoluTon.
– Supervised
Learning
– Possible
features
[Bengtson
&
Roth,
2008]
Co-‐Reference
and
Polysemy
ResoluTon
136. Co-‐Reference
and
Polysemy
ResoluTon
• Co-‐reference
and
Polysemy:
– Supervised
Learning
NLP
tools
(POS,
Parse
Trees)
text
Features
ExtracTon
Classifier
Kernels
137. • Supervised
Learning
using
Kernels
– A
Kernel
defines
similarity
implicitly
in
a
higher
dimensional
space
– Can
be
based
on
Strings,
Word
Sequences,
Parse
Trees,
etc.
• For
strings
similarity∝
number
of
common
substrings
(or
subsequences)
• Recommended
reading
on
string
kernels
[Lodhi
et.
al.,
2002]
Co-‐Reference
and
Polysemy
ResoluTon
138. • Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
Pa@ern
Examples
Co-‐Reference
and
Polysemy
ResoluTon
139. • Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
Pa@ern
Examples
X
also
know
as
Y
Co-‐Reference
and
Polysemy
ResoluTon
140. • Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Set
of
labeled
Pa@ern
Examples
Pair
of
Instances
Classifier
Co-‐Reference
and
Polysemy
ResoluTon
X
also
know
as
Y
141.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pair
of
Instances
Classifier
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Co-‐Reference
and
Polysemy
ResoluTon
X
also
know
as
Y
142. Pair
of
Instances
Classifier
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Apple
Computer
-‐
Apple
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Co-‐Reference
and
Polysemy
ResoluTon
X
also
know
as
Y
143.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
Pair
of
Instances
Classifier
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Co-‐Reference
and
Polysemy
ResoluTon
Apple
Computer
-‐
Apple
X
also
know
as
Y
144.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
Pair
of
Instances
Classifier
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Co-‐Reference
and
Polysemy
ResoluTon
Apple
Computer
-‐
Apple
X
also
know
as
Y
What
about
unsupervised?
145.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
Pair
of
Instances
Classifier
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Co-‐Reference
and
Polysemy
ResoluTon
What
about
unsupervised?
146.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
Pa@ern
Classifier
Pair
of
Instances
Classifier
• Semi-‐supervised
Approaches
– Bootstrap
can
generate
a
large
number
of
pa@erns
and
relaTon
instances.
Co-‐Reference
and
Polysemy
ResoluTon
147. • Co-‐Reference
ResoluTon:
[Singh
et
al.,
2011],
[Krishnamurthy
&
Mitchell,
2011],[Du@a
&
Weikum,
2015]
• Polysemy
ResoluTon:
[Krishnamurthy
&
Mitchell,
2011],
[Galárraga
et
al.,
2014]
Co-‐Reference
and
Polysemy
ResoluTon
148. Machine
Reading
• Named
EnTty
ResoluTon/ExtracTon
• RelaTon
ExtracTon
• Co-‐reference
and
Synonym
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
RepresentaTon
• Document/Sentence
Understanding
(Micro-‐
Reading)
149. Machine
Reading
• RelaTon
Discovery
– Which
new
relaTons
are
there
for
given
pair
of
enTTes?
• hasAdvisor
(JimGray,
MikeHarrison)
150. Machine
Reading
• RelaTon
Discovery
– Which
new
relaTons
are
there
for
given
pair
of
enTTes?
• hasAdvisor
(JimGray,
MikeHarrison)
• hasCoAuthor(HectorGarcia-‐Molina,
Gio
Wiederhold)
151. Machine
Reading
• RelaTon
Discovery
– Which
new
relaTons
are
there
for
given
pair
of
enTTes?
• hasAdvisor
(JimGray,
MikeHarrison)
• hasCoAuthor(HectorGarcia-‐Molina,
Gio
Wiederhold)
• graduatedAt
(JimGray,
Berkeley)
152. Machine
Reading
• RelaTon
Discovery
– Which
new
relaTons
are
there
for
given
pair
of
enTTes?
• hasAdvisor
(JimGray,
MikeHarrison)
• hasCoAuthor(HectorGarcia-‐Molina,
Gio
Wiederhold)
• graduatedAt
(JimGray,
Berkeley)
• studiedAt
(HectorGarcia-‐Molina,
Stanford)
• bornOn
(JohnLennon,
9Oct1940)
• releasedAlbum
(JohnLennon,
10Dec1965)
153.
Set
of
labeled
pairs
of
Instances
Examples
Set
of
labeled
Pa@ern
Examples
RelaTon
Discovery
Clustering
Algorithm
154. Machine
Reading
• Named
EnTty
ResoluTon/ExtracTon
• RelaTon
ExtracTon
• Co-‐reference
and
Synonym
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
RepresentaTon
• Document/Sentence
Understanding
(Micro-‐
Reading)
155. Inference
• Inference
is
the
act
or
process
of
deriving
logical
conclusions
from
premises
known
or
assumed
to
be
true
[Wikipedia]
156. Inference
• Manually
craved
inference
rules
• AutomaTcally
learned
inference
rules
• Data
mining
the
Knowledge
Base
157. Machine
Reading
• Named
EnTty
ResoluTon/ExtracTon
• RelaTon
ExtracTon
• Co-‐reference
and
Synonym
ResoluTon
• RelaTon
Discovery
• Inference
• Knowledge
Base
RepresentaTon
• Document/Sentence
Understanding
(Micro-‐
Reading)
162. Document/Sentence
UnderstanTng
(MicroRead)
• “The
scienTst
observed
the
bu[erfly
with
the
blue
circle”
• “The
scienTst
observed
the
bu@erfly
with
the
blue
microscope”
163. Document/Sentence
UnderstanTng
(MicroRead)
• “The
scienTst
observed
the
bu[erfly
with
the
blue
circle”
• “The
scienAst
observed
the
bu@erfly
with
the
blue
microscope”
164. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– DBPedia
– YAGO
– KnowItAll
– NELL
165. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– DBPedia
– YAGO
– KnowItAll
– NELL
168. DBPedia
Mapping
Wikipedia
semi-‐structured
data
into
RDF
triples
Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
169. DBPedia
• How
to
Read
Wikipedia
Semi-‐structured
data?
[Lehmann
et
al.,
2014]
– Parse
Wikipedia
Markup
language
– Overcome
the
lack
of
standard
problem
• Same
properTes
might
have
different
names
• “Datebirth”
and
“Birth_date”
• “Birthplace”
and
“Birth_place”
– Instead
of
“Modeling
the
World”,
try
to
structure
the
available
informaTon
174. YAGO
• Yet
Another
Great
Ontology
-‐
YAGO
• Main
Goal:
building
a
conveniently
searchable,
large-‐scale,
highly
accurate
knowledge
base
of
common
facts
in
a
machine-‐processable
representaTon
176. YAGO
• Turn
Web
into
Knowledge
Base
[Weikum
et
al.,
2009]
– Building
a
comprehensive
Knowledge
Base
of
human
knowledge
– knowledge
from
Wikipedia
and
WordNet
– the
ontology
check
itself
for
precision
177. YAGO
• The
knowledge
base
is
automaTcally
constructed
from
Wikipedia
• Each
arTcle
in
Wikipedia
becomes
an
enTty
in
the
kb
(e.g.,
since
Leonard
Cohen
has
an
arTcle
in
Wikipedia,
LeonardCohen
becomes
an
enTty
in
YAGO).
185. YAGO
• Certain
categories
are
exploited
to
deliver
type
informaTon
(e.g.,
the
arTcle
about
Leonard
Cohen
is
in
the
category
Canadian
male
poets,
so
he
becomes
a
Canadian
poet).
188. YAGO
• For
each
category
of
a
page
[Hoffart
et
al.,
2012]
– Using
shallow
parsing,
determine
the
head
word
of
the
category
name.
In
the
example
of
Canadian
poets,
the
head
word
is
poets.
– If
the
head
word
is
in
plural,
then
proposes
the
category
as
a
class
and
the
arTcle
enTty
as
an
instance
– Link
the
class
to
the
WordNet
taxonomy
(most
frequent
sense
of
the
head
word
in
WordNet)
• only
countable
nouns
can
appear
in
plural
form
• only
countable
nouns
can
be
ontological
classes
• themaTc
categories
(such
as
Canadian
poetry)
are
different
from
conceptual
Categories
189. YAGO
• head
words
that
are
not
conceptual
even
though
they
appear
in
plural
(such
as
stubs
in
Canadian
poetry
stubs)
are
in
the
first
list
of
excepTons.
• words
that
do
not
map
to
their
most
frequent
sense,
but
to
a
different
sense
are
in
the
second
excepTon
list
– The
word
capital,
e.g.,
refers
to
the
main
city
of
a
country
in
the
majority
of
cases
and
not
to
the
financial
amount,
which
is
the
most
frequent
sense
in
WordNet.
190. YAGO
• About
100
manually
defined
relaTons
– wasBornOnDate
– locatedIn
– hasPopulaTon
• Categories
and
infoboxes
are
exploited
to
deliver
facts
(instances
of
relaTons).
• Manually
defined
pa@erns
that
map
categories
and
infobox
a@ributes
to
fact
templates
– infobox
a@ribute
born=Montreal,
thus
wasBornIn(LeonardCohen,
Montreal)
• Pa@ern-‐based
extracTons
resulted
in
2
million
extracted
enTTes
and
20
million
facts
191. YAGO
• Based
on
declaraTve
rules
(stored
in
text
files)
• The
rules
take
the
form
of
subject-‐
predicate-‐
object
triples,
so
that
they
are
basically
addiTonal
facts
• There
are
different
types
of
rules
192. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
193. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
194. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
Knowledge
RepresentaTon
195. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
196. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
Inference
197. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
198. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
Knowledge
RepresentaTon
199. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
200. YAGO
• Factual
rules:
definiTon
of
all
relaTons,
their
domains
and
ranges,
and
the
definiTon
of
the
classes
that
make
up
the
YAGO
hierarchy
of
literal
types.
• ImplicaAon
rules:
express
that
if
certain
facts
appear
in
the
knowledge
base,
then
another
fact
shall
be
added.
Horn
clause
rules.
• Replacement
rules:
for
interpreTng
micro-‐formats,
cleaning
up
HTML
tags,
and
normalizing
numbers.
• ExtracAon
rules:
apply
primarily
to
pa@erns
found
in
the
Wikipedia
infoboxes,
but
also
to
Wikipedia
categories,
arTcle
Ttles,
and
even
other
regular
elements
in
the
source
such
as
headings,
links,
or
references.
InformaTon
ExtracTon
204. YAGO
• Ontology
RepresentaTon
– EnTTes
and
RelaTons
of
public
interest
– Format:
TSV,
RDF,
XML,
N3,
Web
Interface
– Learns
• Instances
and
pa@erns
from
Wikipedia;
• Taxonomy
from
WordNet;
• Geotags
informaTon
from
Geonames.
205. YAGO
• Named
EnTty
ResoluTon/ExtracTon
[Theobald
&
Weikum,
2012]
– Based
on
rules
and
pa@erns
extracted
from
Wikipedia
– DisambiguaTon
is
a
relevant
issue
– Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
• Wikipedia
infoboxes
&
categories
• HMTL
lists
&
tables,
etc.
206. YAGO
• Named
EnTty
ResoluTon/ExtracTon
[Theobald
&
Weikum,
2012]
– Based
on
rules
and
pa@erns
extracted
from
Wikipedia
– DisambiguaTon
is
a
relevant
issue
– Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
• Wikipedia
infoboxes
&
categories
• HMTL
lists
&
tables,
etc.
Natural
Language
Processing
Machine
Learning
207. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
This
slide
was
adapted
from
[Hady
et
al.,
2011]
208. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
This
slide
was
adapted
from
[Hady
et
al.,
2011]
209. YAGO
• RelaTon
ExtracTon
[Theobald
&
Weikum,
2012]
– Based
on
rules
and
pa@erns
extracted
from
Wikipedia
– Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
• Wikipedia
infoboxes
&
categories
• HMTL
lists
&
tables,
etc.
210. YAGO
• RelaTon
ExtracTon
[Theobald
&
Weikum,
2012]
– Based
on
rules
and
pa@erns
extracted
from
Wikipedia
– Semi-‐structured
data
The
“Low-‐Hanging
Fruit”
• Wikipedia
infoboxes
&
categories
• HMTL
lists
&
tables,
etc.
Natural
Language
Processing
Machine
Learning
211. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
This
slide
was
adapted
from
[Hady
et
al.,
2011]
212. Machine
Reading
same
This
slide
was
adapted
from
[Hady
et
al.,
2011]
It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
213. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
same
same
same
same
same
same
This
slide
was
adapted
from
[Hady
et
al.,
2011]
214. It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.
The old man draws Blomkvist in by promising solid evidence against Wennerström.
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes acquainted with the members of the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
After discovering that Salander has hacked into his computer, he persuades her to assist
him with research. They eventually become lovers, but Blomkvist has trouble getting close
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
supports herself by doing deep background investigations for Dragan Armansky, who, in
turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Machine
Reading
same
same
same
same
same
same
uncleOf
owns
hires
headOf
This
slide
was
adapted
from
[Hady
et
al.,
2011]
216. YAGO
• YAGO2:
Exploring
and
Querying
World
Knowledge
in
Time,
Space,
Context,
and
Many
Languages
– New
relaTons
specifically
designed
to
cover
Tme,
space
and
context
– Wikipedia
translated
pages
as
sources
for
other
languages
217. YAGO
• YAGO3
[Mahdisoltani
&
Biega
&
Suchanek,
2015]
– an
extension
of
the
YAGO
knowledge
base;
– built
from
the
Wikipedias
in
mulTple
languages.
– fuses
the
mulTlingual
informaTon
with
the
English
WordNet
– categories,
infoboxes,
and
Wikidata,
to
learn
the
meaning
of
infobox
a@ributes
across
languages
– 10
different
languages
– precision
of
95%-‐100%
in
the
a@ribute
mapping
– enlarges
YAGO
by
1m
new
enTTes
and
7m
new
facts.
218. YAGO
• More
on
YAGO:
– Very
nice
tutorials:
• “Knowledge
Bases
for
Web
Content
AnalyTcs”
at
WWW
2015,
Florence,
May
2015.
• "SemanTc
Knowledge
Bases
from
Web
Sources"
at
IJCAI
2011,
Barcelona,
July
2011
"HarvesTng
Knowledge
from
Web
Data
and
Text"
at
CIKM
2010,
Toronto,
October
2010
"From
InformaTon
to
Knowledge:
HarvesTng
EnTTes
and
RelaTonships
from
Web
Sources"
at
PODS
2010,
Indianapolis,
June
2010
– Project
Website:
• h[p://www.mpi-‐inf.mpg.de/yago-‐naga/
219. YAGO
• More
on
YAGO
(h[p://www.mpi-‐inf.mpg.de/yago-‐naga/)
220. YAGO
• More
on
YAGO
(h[p://www.mpi-‐inf.mpg.de/yago-‐naga/)
221. YAGO
• More
on
YAGO
(h[p://www.mpi-‐inf.mpg.de/yago-‐naga/)
222. YAGO
• More
on
YAGO
(h[p://www.mpi-‐inf.mpg.de/yago-‐naga/)
?X
<hasChild>
?C
?Y
<hasChild>
?C
=>
?X
<isMarriedTo>
?Y
223. YAGO
• More
on
YAGO
(h[p://www.mpi-‐inf.mpg.de/yago-‐naga/)
?X
<hasChild>
?C
?Y
<hasChild>
?C
=>
?X
<isMarriedTo>
?Y
Machine
Learning
224. YAGO
• More
on
YAGO
(h[p://www.mpi-‐inf.mpg.de/yago-‐naga/)
?X
<hasChild>
?C
?Y
<hasChild>
?C
=>
?X
<isMarriedTo>
?Y
Machine
Learning
Inference
225. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– DBPedia
– YAGO
– KnowItAll
– NELL
226. Outline
• Machine
Learning
• Machine
Reading
• Reading
the
Web
– DBPedia
– YAGO
– KnowItAll
– NELL
230. KnowItAll
• MoTvaTon:
New
Paradigm
for
Search
[Etzioni,
2008]
– The
future
of
Web
Search
– Read
the
Web
instead
of
retrieving
Web
pages
to
perform
Web
Search
231. KnowItAll
• InformaTon
ExtracTon
(IE)
+
tractable
inference
– IE(sentence)
=
who
did
what?
• speaker(P.
Smith,
ECMLPKDD2012)
– Inference
=
uncover
implicit
informaTon
• Will
Pi@sburgh
Steelers
be
champions
again?
• Open
InformaTon
ExtracTon
[Banko
et
al.,
2007]
232. Open
InformaTon
ExtracTon
[Banko
et
al.,
2007]
• Open
IE
systems
avoid
specific
nouns
and
verbs
• Extractors
are
unlexicalized—formulated
only
in
terms
of:
–
syntacTc
tokens
(e.g.,
part-‐of-‐speech
tags)
– closed-‐word
classes
(e.g.,
of,
in,
such
as).
• Open
IE
extractors
focus
on
generic
ways
in
which
relaTonships
are
expressed
in
English
– naturally
generalizing
across
domains.
233. Open
InformaTon
ExtracTon
[Banko
et
al.,
2007]
• Open
IE
extractors
focus
on
generic
ways
in
which
relaTonships
are
expressed
in
English
– naturally
generalizing
across
domains.
RelaTon
Discovery
234. Open
InformaTon
ExtracTon
• Open
IE
systems
are
tradiTonally
based
on
three
steps
[Etzioni
et
al.,
2011]:
– 1.
Label:
Sentences
are
automaTcally
labeled
with
extracTons
using
heurisTcs
or
distant
supervision.
Unsupervised
Learning
235. Open
InformaTon
ExtracTon
• Open
IE
systems
are
tradiTonally
based
on
three
steps
[Etzioni
et
al.,
2011]:
– 1.
Label:
Sentences
are
automaTcally
labeled
with
extracTons
using
heurisTcs
or
distant
supervision.
– 2.
Learn:
A
relaTon
phrase
extractor
is
learned
using
a
sequence-‐labeling
graphical
model
(e.g.,
CRF).
Supervised
Learning