From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 1
From Text to Truth: Real World Facets for
Multilingual Search
Benson Margulies
Executive Vice President and Chief Technical Officer

Your job is to analyze reciprocal antagonism
between Christian and Islamic extremists across the
globe.
You want to find information on the Internet on
Christian extremist reaction to the killing of the U.S.
Ambassador to Libya.
Motivation

That was a lot of work.
Can text analytics help?
Help?

✓

✗

✗

Filter out pages with the wrong guy?
Filter?

✓

✗

✗

Add some filters (a/k/a facets)…
Filter?

✓

✗

✗

Filter?

✓

✗

✗

Filter?
Filter
results
by…

People

<choice
1>

<choice
2>

<choice
3>

…

✓

✗

✗

But what can we use as choices?
Filter?
Filter
results
by…

People

<choice
1>

<choice
2>

<choice
3>

…

Find names of person, places, organizations in document.
Entity Extraction (Name Tagging)

Group names referring to the same person, within a document.
In-document Coreference Resolution

✓

✗

✗

But what can we use as choices?
Filter choices?
Filter
results
by…

People

<choice
1>

<choice
2>

<choice
3>

…

✓

✗

✗

Choices: first way that each person was mentioned
in each document?
Filter choices?
Filter
results
by…

Persons
named

Kris
Stephens

Chris
Stephens

Dan
Cathy

George
LiBle

…

✓

✗

Choices: first name string for each person in each
document?
Filter?
Add
ﬁlters…

Persons
named

Dan
Cathy

George
LiBle

…

Filtered
by…

Persons
named

Chris
Stephens
✗

✓

✗

Choices: first name string for each person in each
document?
Filter?
Add
ﬁlters…

Persons
named

Dan
Cathy

George
LiBle

…

Filtered
by…

Persons
named

Chris
Stephens

✓

✗

Problem: Ambiguity – one name, many entities
Filter?
Add
ﬁlters…

Persons
named

Dan
Cathy

George
LiBle

…

Filtered
by…

Persons
named

Chris
Stephens

✓

✗

Problem: Variety – one person, many names
Filter?
Add
ﬁlters…

Filtered
by…

Add
ﬁlters…

Persons
named

Dan
Cathy

George
LiBle

…

Filtered
by…

Persons
named

Chris
Stephens

✓

✗

Problem: Variety – one person, many names
Filter?
Add
ﬁlters…

Persons
named

Dan
Cathy

George
LiBle

…

Chris
Stevens

J.
Christopher

Stevens

…

Filtered
by…

Persons
named

Chris
Stephens

✓

✗

✗

Magically group names by person across
documents.
Deal with ambiguity and variety?
Filter
results
by…

People

<choice
1>

<choice
2>

<choice
3>

…

✓

✗

✗

But there’s still the problem of choices…
Labels for choices?
Filter
results
by…

People

<choice
1>

<choice
2>

<choice
3>

…

✓

✗

✗

Use person’s name from highest ranked doc?
Still some ambiguity.
Labels for choices?
Filter
results
by…

People

Kris
Stephens

Chris
Stephens
1

Chris
Stephens
2

…

✓

✗

✗

Entity Resolution: group and also link to a
database of known entities (e.g., Wikipedia).
Labels for choices?
Filter
results
by…

People

Kris
Stephens

Chris
Stephens
1

Chris
Stephens
2

…

Kris
Stephens

J.
Christopher

Stevens

Chris
Stephens

…

✓

✗

✗

Labels for choices?
Filter
results
by…

People

For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Kris
Stephens

J.
Christopher

Stevens

Chris
Stephens

…

✓

✗

✗

For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Filter?
Filter
results
by…

People

Kris
Stephens

(pastor)

J.
Christopher

Stevens

Chris
Stephens

(pastor)

✓

✗

✗

Let’s give it a try…
Filter.
Filter
results
by…

People

Kris
Stephens

(pastor)

J.
Christopher

Stevens

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

✓

✗

Filter.
Add
ﬁlters…

People

Kris
Stephens

(pastor)

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

Filtered
by…

People

J.
Christopher

Stevens

✗

✓

Filter.
Add
ﬁlters…

People

Kris
Stephens

(pastor)

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

Filtered
by…

People

J.
Christopher

Stevens

✓

On a cross lingual index, real-world entity facets can
open results up across languages, unlike search
strings
Filter.
Add
ﬁlters…

People

Kris
Stephens

(pastor)

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

Filtered
by…

People

J.
Christopher

Stevens

✓

✓

Language

English

Chinese

Arabic

Let’s pretend you’re researching the pastors
instead.
Trading off Errors
Filter
results
by…

People

Kris
Stephens

(pastor)

J.
Christopher

Stevens

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

What if you think there are too many (or too few)?
Add a slider for making filter more fine (or coarse).
Trading off Errors
Add
ﬁlters…

People

J.
Christopher

Stevens

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

Filtered
by…

People

Kris
Stephens

(pastor)

Make the filter more fine.
Trading off Errors
Add
ﬁlters…

People

J.
Christopher

Stevens

Chris
Stephens

(pastor)

Dan
Cathy

George
LiBle

…

Filtered
by…

People

Kris
Stephens

(pastor)

RNI Similarity Matching “Tamerlan Tsarnaev”
And the problem only gets worse with Multiple Languages

Fuzzy name search in Solr
• Facets
are
one
way
to
navigate
names

o  assume
that
you've
found
some
interesNng
data

with
an
ordinary
query

o  what
if
you
are
having
trouble
gePng
started?

• Name-‐speciﬁc
comparison
search
is
another

• More
complex
algorithm
than
levenshtein

distance
on
names

Plugging in more complex search
• Open
up
the
'search
component
pipeline'

• First
component
preprocesses
query

o  Maps
from
"Fred
Chopin"
to
a
complex
Lucene

query
that
looks
for
possible
matches
across

languages
and
scripts

• Second
component
rescores
results

o  detailed
comparison
of
pairs
of
names
to
derive

ﬁnal
score.

• Sad
limitaNon
(so
far):
scores
not
normalized

to
ordinary
Lucene
values

And it does SolrCloud, too ...
• Preprocessor
runs
before
fan-‐out
to
shards

• rescoring
runs
out
on
the
shards

• So
the
work
of
checking
candidate
matches
is

divided
up
amongst
the
scores.

Questions
•  Suggested questions:
– Doesn’t Google already do this?
– Speed? Scale?
– Multi-lingual?
– What other uses are there for entity resolution
beyond faceted search?

Doesn’t
Google
already
do
this?

Some, when searching for famous entities.

Speed/Scale
•  Future Plans include scaling experiments
•  Research version:
– tested up to 1m docs
– Sub-second per document
– Incremental updates (i.e., you see documents
published minutes ago)

Other uses for entity resolution ?
•  Supporting relationship resolution by resolving
participating entities in the them.
•  Knowledge base population
•  Integrating disparate data sets
•  Alerting
•  Improving relevance of search results
•  Predictive Analytics

For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090
Thank you!

CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
Benson Margulies
benson@basistech.com

From text to truth real world facets for multilingual search

Recommended

Recommended

More Related Content

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

From text to truth real world facets for multilingual search