Entities, Graphs, and Crowdsourcing for better Web Search

En##es,
Graphs,
and
Crowdsourcing

for
be7er
Web
Search

Gianluca
Demar#ni

eXascale
Infolab

University
of
Fribourg,
Switzerland

Gianluca
Demar#ni

•  M.Sc.
at
University
of
Udine,
Italy

•  Ph.D.
at
University
of
Hannover,
Germany

–  En#ty
Retrieval

•  Worked
for
UC
Berkeley
(on
Crowdsourcing),
Yahoo!
Research

(Spain),
L3S
Research
Center
(Germany)

•  Post-‐doc
at
the
eXascale
Infolab,
Uni
Fribourg,
Switzerland.

•  Lecturer
for
Social
Compu,ng
in
Fribourg

•  Tutorial
on
En#ty
Search
at
ECIR
2012,
on
Crowdsourcing
at

ESWC
2013
and
ISWC
2013

•  Research
Interests

–  Informa#on
Retrieval,
Seman#c
Web,
Crowdsourcing

2

demartini@exascale.info
Gianluca
Demar#ni

Web
of
Data

•  Freebase

–  Acquired
by
Google
in
July
2010.

–  Knowledge
Graph
launched
in
May
2012.

•  Schema.org

–  Driven
by
major
search
engine
companies

–  Machine-‐readable
annota#ons
of
Web
pages

•  Linked
Open
Data

–  31
billion
triples,
Sept.
2011

Gianluca
Demar#ni
5

Linked
Open
Data

Z.
Kaoudi
and
I.
Manolescu,
ICDE
seminar
2013

6

I
will
talk
about

•  En#ty
Linking/Disambigua#on

– On
the
Web
using
crowdsourcing

– For
scien#ﬁc
literature
using
graphs

•  Ad-‐hoc
Object
Retrieval
(En#ty
Ranking)

– Using
IR
and
graphs

•  Crowdsourced
Query
Understanding

Gianluca
Demar#ni
7

Disclaimer

•  No
eﬃciency
evalua#on

– Approaches
not
distributed

– But
designed
to
scale
out

•  No
user
studies

– Goal:
Obtain
high
quality
data

– Only
TREC-‐like
evalua#on
on
eﬀec#veness

Gianluca
Demar#ni
8

En#ty
Linking/Disambigua#on

Gianluca
Demar#ni
10

h7p://dbpedia.org/resource/Facebook

h7p://dbpedia.org/resource/Instagram

jase:Instagram

owl:sameAs

Google

Android

Facebook
is
not
wai#ng
for
its
ini#al

public
offering
to
make
its
first
big

purchase.In
its
largest

acquisi#on
to
date,
the
social
network

has
purchased
Instagram,
the
popular

photo-‐sharing
applica#on,
for
about
$1

billion
in
cash
and
stock,
the
company

said
Monday.

<cite
property=”rdfs:label">Facebook</
cite>
is
not
wai#ng
for
its
ini#al
public
offering
to

make
its
first
big
purchase.In

its
largest
acquisi#on
to
date,
the
social
network
has

purchased
<cite
property=”rdfs:label">Instagram</
cite>
,
the
popular
photo-‐sharing
applica#on,
for

about
$1
billion
in
cash
and
stock,
the
company
said

Monday.

RDFa

enrichment

HTML:

Crowdsourcing

•  Exploit
human
intelligence
to
solve

– Tasks
simple
for
humans,
complex
for
machines

– With
a
large
number
of
humans
(the
Crowd)

– Small
problems:
micro-‐tasks
(Amazon
MTurk)

•  Examples

– Wikipedia,
Image
tagging

•  Incen#ves

– Financial,
fun,
visibility

Gianluca
Demar#ni
11

ZenCrowd

•  Combine
both
algorithmic
and
manual
linking

•  Automate
manual
linking
via
crowdsourcing

•  Dynamically
assess
human
workers
with
a

probabilis#c
reasoning
framework

12

Crowd

Algorithms
Machines

Gianluca
Demar#ni

ZenCrowd
Architecture

Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
ZenCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Gianluca
Demar#ni
13

Gianluca
Demar#ni,
Djellel
Eddine
Difallah,
and
Philippe
Cudré-‐Mauroux.
ZenCrowd:
Leveraging
Probabilis#c

Reasoning
and
Crowdsourcing
Techniques
for
Large-‐Scale
En#ty
Linking.
In:
21st
Interna#onal
Conference
on

World
Wide
Web
(WWW
2012).

Algorithmic
Matching

•  Inverted
index
over
LOD
en##es

– DBPedia,
Freebase,
Geonames,
NYT

•  TF-‐IDF
(IR
ranking
func#on)

•  Top
ranked
URIs
linked
to
en##es
in
docs

•  Threshold
on
the
ranking
func#on
or
top
N

Gianluca
Demar#ni
14

En#ty
Factor
Graphs

•  Graph
components

– Workers,
links,
clicks

– Prior
probabili#es

– Link
Factors

– Constraints

•  Probabilis#c

Inference

– Select
all
links
with

posterior
prob
>τ

w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11
c22
c12
c21
c13
c23
u2-3( )sa1-2( )
2
workers,
6
clicks,
3
candidate
links

Link
priors

Worker

priors

Observed

variables

Link

factors

SameAs

constraints

Dataset

Unicity

constraints

Gianluca
Demar#ni
15

En#ty
Factor
Graphs

•  Training
phase

– Ini#alize
worker
priors

– with
k
matches
on
known
answers

•  Upda#ng
worker
Priors

– Use
link
decision
as
new
observa#ons

– Compute
new
worker
probabili#es

•  Iden#fy
(and
discard)
unreliable
workers

Gianluca
Demar#ni
16

Experimental
Evalua#on

•  Datasets

–  25
news
ar#cles
from

•  CNN.com
(Global
news)

•  NYTimes.com
(Global
news)

•  Washington-‐post.com
(US
local
news)

•  Timesoﬁndia.india#mes.com
(India
news)

•  Swissinfo.com
(Switzerland
local
news)

–  40M
en##es
(Freebase,
DBPedia,
Geonames,
NYT)

Gianluca
Demar#ni
17

Worker
Selec#on

Gianluca
Demar#ni
18

Top$US$
Worker$
0$
0.5$
1$
0$ 250$ 500$
Worker&Precision&
Number&of&Tasks&
US$Workers$
IN$Workers$
0.6$
0.62$
0.64$
0.66$
0.68$
0.7$
0.72$
0.74$
0.76$
0.78$
0.8$
1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$Precision)
Top)K)workers)

Lessons
Learnt

•  Crowdsourcing
+
Prob
reasoning
works!

•  But

– Different
worker
communi#es
perform
differently

– Many
low
quality
workers

– Comple#on
#me
may
vary
(based
on
reward)

•  Need
to
find
the
right
workers
for
your
task

(see
WWW13
paper)

Gianluca
Demar#ni
19

ZenCrowd
Summary

•  ZenCrowd:
Probabilis#c
reasoning
over
automa#c
and

crowdsourcing
methods
for
en#ty
linking

•  Standard
crowdsourcing
improves
6%
over
automa#c

•  4%
-‐
35%
improvement
over
standard
crowdsourcing

•  14%
average
improvement
over
automa#c
approaches

•  On-‐going
work:

–  Also
used
for
instance
matching
across
datasets

–  3-‐way
blocking
with
the
crowd

h7p://exascale.info/zencrowd/

Gianluca
Demar#ni
20

En#ty
Disambigua#on

in
Scien#ﬁc
Literature

•  Using
a
background
concept
graph

Roman
Prokofyev,
Gianluca
Demar#ni,
Philippe
Cudré-‐Mauroux,
Alexey
Boyarsky,
and
Oleg
Ruchayskiy.

Ontology-‐Based
Word
Sense
Disambigua#on
in
the
Scien#ﬁc
Domain.
In:
35th
European
Conference
on

Informa#on
Retrieval
(ECIR
2013).

Gianluca
Demar#ni
21

h7p://scienceWISE.info/

Ad-‐hoc
Object
Retrieval

•  Once
en##es
have
been
iden#ﬁed…

•  We
want
to
rank
them
as
answer
to
a
query

•  AOR

– Given
the
descrip#on
of
an
en#ty

– give
me
back
its
iden#ﬁer

– Input:
query
q,
data
graph
G

– Output:
ranked
list
of
URIs
from
G

Gianluca
Demar#ni
23

An
Hybrid
Approach
to
AOR

Alberto
Tonon,
Gianluca
Demar#ni,
and
Philippe
Cudré-‐Mauroux.
Combining
Inverted
Indices
and
Structured

Search
for
Ad-‐hoc
Object
Retrieval.
In:
35th
Annual
ACM
SIGIR
Conference
(SIGIR
2012).

index()
User
Query Annotation
and Expansion
Inverted Index
RDF
Store
Ranking
FunctionsRanking
FunctionsRanking
Functions
query()
Entity Search
Keyword Query
intermediate
top-k results
Graph-Enriched
Results
Graph Traversals
(queries on object
properties)
Neighborhoods
(queries on datatype
properties)
Structured
Inverted Index
WordNet
3rd party
search engines
Final Ranking
Function
Pseudo-Relevance Feedback
Gianluca
Demar#ni
24

AOR
Evalua#on

•  1.3
billions
RDF
triples
from
LOD
cloud

•  92
and
50
queries

•  Crowdsourced
relevance
judgments

•  semsearch.yahoo.com

Gianluca
Demar#ni
25

Evalua#on
Results

Gianluca
Demar#ni
26

Summary

•  AOR
=
“Given
the
descrip,on
of
an
en,ty,
give

me
back
its
iden,ﬁer”

•  Combining
classic
IR
techniques
+
structured

database
storing
graph
data

•  Signiﬁcantly
be7er
results
(up
to
+25%
MAP

over
BM25
baseline).

•  Overhead
caused
from
the
graph
traversal

part
is
limited

Gianluca
Demar#ni
27

h7p://exascale.info/AOR/

CrowdQ:
Crowdsourced
Query

Understanding

birthdate
of
mayor
of
capital
city
of
france

Gianluca
Demar#ni
29

capital
city
of
france

Gianluca
Demar#ni
30

mayor
of
paris

Gianluca
Demar#ni
31

birthdate
of
Bertrand
Delanoë

Gianluca
Demar#ni
32

Mo#va#on

•  Web
Search
Engines
can
answer
simple
factual

queries
directly
on
the
result
page

•  Users
with
complex
informa#on
needs
are

oyen
unsa#sﬁed

•  Purely
automa#c
techniques
are
not
enough

•  We
want
to
solve
it
with
Crowdsourcing!

Gianluca
Demar#ni
33

CrowdQ

•  CrowdQ
is
the
ﬁrst
system
that
uses

crowdsourcing
to

– Understand
the
intended
meaning

– Build
a
structured
query
template

– Answer
the
query
over
Linked
Open
Data

Gianluca
Demar#ni
34

Gianluca
Demar#ni,
Beth
Trushkowsky,
Tim
Kraska,
and
Michael
Franklin.
CrowdQ:

Crowdsourced
Query
Understanding.
In:
6th
Biennial
Conference
on
Innova#ve
Data
Systems

Research
(CIDR
2013).

User
Keyword Query
On#line'Complex'Query
Processing
Complex
query
classifier
Crowdsourcing
Platform
Vetrical
selection,
Unstructured
Search, ...
POS + NER tagging
Query Template Index
Crowd
Manager
N
Y
Queries Templ +
Answer Types
Structured
LOD Search
Result Joiner
Template Generation
SERP
t1
t2
t3
Off#line'Complex'Query
Decomposition
Structured Query
Query
Log
query
N
Answer
Composition
LOD Open Data Cloud
Match with existing
query templates
CrowdQ
Architecture

36

Off-‐line:
query
template
genera#on
with
the
help
of
the
crowd

On-‐line:
query
template
matching
using
NLP
and
search
over
open
data

Hybrid
Human-‐Machine
Pipeline

Gianluca
Demar#ni
37

Q=
birthdate
of
actors
of
forrest
gump

Query
annota#on
Noun
Noun
Named
en#ty

Veriﬁca#on

En#ty
Rela#ons

Is
forrest
gump
this
en#ty
in
the
query?

Which
is
the
rela#on
between:
actors
and
forrest
gump
starring

Schema
element
Starring

<dbpedia-‐owl:starring>

Veriﬁca#on
Is
the
rela#on
between:

Indiana
Jones
–
Harrison
Ford

Back
to
the
Future
–
Michael
J.
Fox

of
the
same
type
as

Forrest
Gump
-‐
actors

Structured
query
genera#on

SELECT
?y
?x

WHERE
{
?y
<dbpedia-‐owl:birthdate>
?x
.

?z
<dbpedia-‐owl:starring>
?y
.

?z
<rdfs:label>
‘Forrest
Gump’
}

Gianluca
Demar#ni
38

Results
from
BTC09:

Q=
birthdate
of
actors
of
forrest
gump

MOVIE

MOVIE

Conclusions

•  Structured
Data
make
Web
Search
be7er

•  Exploit
the
best
out
of
structured
and

unstructured
data
(Hybrid
AOR)

•  Crowd
can
help
in
understanding
seman#cs

•  Hybrid
human-‐machine
systems
(ZenCrowd)

•  Exploit
Human
Intelligence
at
Scale
(CrowdQ)

gianlucademartini.net demartini@exascale.info
Gianluca
Demar#ni
39

Entities, Graphs, and Crowdsourcing for better Web Search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Entities, Graphs, and Crowdsourcing for better Web Search

Similar to Entities, Graphs, and Crowdsourcing for better Web Search (20)

More from eXascale Infolab

More from eXascale Infolab (20)

Recently uploaded

Recently uploaded (20)

Entities, Graphs, and Crowdsourcing for better Web Search