Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Optimizing Multilingual Search
David Troiano
Principal Software Engineer, Basis Technology
dtroiano@basistech.com

Talk Overview
• The problem we’re trying to solve
• Natural language processing (NLP)
• Approaches to multilingual search in Solr

The Goal
Build a search engine where:
• Document corpus spans multiple languages
• Poten&ally
mixed
language
documents
• Queries within a language, or potentially spanning multiple

NLP Meets Search (Querying)
Terms
Inverted
Index
query:
“clinton
speaking”
NLP
pipeline
clinton,
speak
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...

NLP Meets Search (Indexing)
Document
123
Terms
Inverted
Index
Bill
Clinton
spoke
about
...
NLP
pipeline
bill,
clinton,
speak,
about
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...

NLP Meets Search
Terms
Inverted
Index
Bill
Clinton
spoke
about
...
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
Document
123
NLP
pipeline
bill,
clinton,
speak,
about
query:
“clinton
speaking”
NLP
pipeline
clinton,
speak

The NLP Pipeline
• Language Detection
• Tokenization
• Decompounding
• Word Form Normalization

Language Detection
• Often required when indexing
• Typically not used at query time
• Lower
accuracy
on
short
strings
• Some&mes
unsolvable
even
to
humans,
e.g.,
named
en&&es
• End
user
applica&ons
oKen
know
query
language
upstream
of
search
engine
• No
readily
available
plugin
paNern
in
Solr

Tokenization
• Breaking text into words
• Particularly difficult with CJK languages
• Find
the
words:
帰国後ハーバード大学に入学を認められていたもの

Decompounding
• Breaking compound words into subcomponents
• Common in German, Dutch, Korean
• Samstagmorgen
Samstag,
morgen

Word Form Normalization
• Reduce word form variations to a canonical representation
• Critical for recall
• Two approaches
• Stemming
• Lemma&za&on

Normalization: Stemming
• Simple rules-based approach
• “Chop off the end”
• arsenal,
arsenic
arsen

Normalization: Lemmatization
• Map words to their dictionary form via morphological analysis
• spoke, speaks, speaking speak
• Higher precision and recall compared to stemming

NLP Meets Search
Terms
Inverted
Index
Bill
Clinton
spoke
about
...
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
Document
123
NLP
pipeline
bill,
clinton,
speak,
about
query:
“clinton
speaking”
NLP
pipeline
clinton,
speak
Solr

NLP Within Solr
• Maximal precision / recall requires NLP pipeline per language
• NLP pipeline (mostly) specified within Solr field type
• Index / query strategies in Solr
• Field
per
language
• Core
per
language
• A
new
approach:
Single
mul&lingual
field

Field Per Language
schema.xml
<field name="content_cjk" type="text_cjk" indexed="true" stored="true" />
<field name="content_eng" type="text_eng" indexed="true" stored="true" />
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
query
http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng

Field Per Language
http://<solr url>/solr/articles/selec tq?=qs=esreirei%e2%02a0 a&defType=edismax&qf=content_cjk%20content_eng

Field Per Language
http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax &qf=content_cjk%20content_eng

Field Per Language
http://<solr url>/solr/articles/select?q=serie%20a&defType=ediqsfm=acxo&nqtfe=ncto_nctjekn%t2_0ccjokn%t2e0ncto_netnegn t_eng

Core Per Language
CJK core’s schema.xml
<field name="content" type="text_cjk" indexed="true" stored="true"
multiValued="true"/>
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
query
http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng

Core Per Language
http://.../select?qq==ccoonntteenntt::sseerriiee%%2200aa &shards=<url>/articles_cjk,<url>/articles_eng

Core Per Language
http://.../select?q=content:series%h2a0rad&ss=h<aurrdls>=/<aurrtli>c/laerst_iccjlke,s<_ucrjlk>,/<aurrtli>c/laerst_iecnlge s_eng

Approach Comparison
Field
Per
Language
Core
Per
Language
Simplicity
Speed
✔
✔

Approach Comparison: Query Latency
Experimental Setup
• Corpus: Wikipedia across 9 languages (9 million articles)
• Queries: 1000 most frequently used terms for each language, randomized
• JMeter running 1 hour for each of 6 test runs
160
140
120
100
80
60
40
20
0
1
4
9
Field
per
lang
Core
per
lang
Avg
latency
(ms)
#
languages
queried

An Alternative Approach
All languages in a single field
• Requires custom meta field type that is applies per-language
concrete field type(s)
• Patch submitted to Solr
cf. Solr In Action / Trey Grainger
https://github.com/treygrainger/solr-in-action

Terms
Inverted
Index
query:
“[en,
es]clinton
speaking”
Inspect
[en,
es],
apply
English
and
Spanish
field
types
to
“clinton
speaking”,
merge
results
clinton,
speak
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...

• Results scoring potentially worse than other approaches
• IDF thrown off with single field
• e.g.,
soy
common
in
Spanish,
rela&vely
rare
in
English
• Consider
a
query
for
“soy
dessert
recipe”
against
a
corpus
of
English
and
Spanish
recipes

Enhancing NLP Pipeline
Limitations of NLP in Solr out of the box
• Poor precision / performance of CJK tokenization
• Poor precision / recall of stemmers (no lemmatizers)
• Poor recall due to lack of decompounding
RoseNe
to
the
rescue!

CJK Tokenization
ケネディはマサチューセッツ
• Rosette: ケネディ, は, マサチューセッツ
• Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サチ, チュ, ュー, ーセ,
セッ, ッツ
• How does this impact precision, recall, index size, speed?

Rosette In Solr
<fieldType name="text_zho" class="solr.TextField">
<analyzer type="index">
<tokenizer
class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
rootDirectory="<rootDir>"
language="zho" />
<filter
class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
rootDirectory="<rootDir>"
language="zho" />
</analyzer>
</fieldType>
cf. http://www.basistech.com/search-essentials/

Wrapping Up
• Multilingual search is everywhere
• Solr as your multilingual search platform
• Search quality hinges on quality of NLP tools

Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Recommended

Recommended

More Related Content

Similar to Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Similar to Optimizing Multilingual Search: Presented by David Troiano, Basis Technology (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Optimizing Multilingual Search: Presented by David Troiano, Basis Technology