5. The Goal
Build a search engine where:
• Document corpus spans multiple languages
• Poten&ally
mixed
language
documents
• Queries within a language, or potentially spanning multiple
6. NLP Meets Search (Querying)
Terms
Inverted
Index
query:
“clinton
speaking”
NLP
pipeline
clinton,
speak
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
7. NLP Meets Search (Indexing)
Document
123
Terms
Inverted
Index
Bill
Clinton
spoke
about
...
NLP
pipeline
bill,
clinton,
speak,
about
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
8. NLP Meets Search
Terms
Inverted
Index
Bill
Clinton
spoke
about
...
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
Document
123
NLP
pipeline
bill,
clinton,
speak,
about
query:
“clinton
speaking”
NLP
pipeline
clinton,
speak
9. The NLP Pipeline
• Language Detection
• Tokenization
• Decompounding
• Word Form Normalization
10. Language Detection
• Often required when indexing
• Typically not used at query time
• Lower
accuracy
on
short
strings
• Some&mes
unsolvable
even
to
humans,
e.g.,
named
en&&es
• End
user
applica&ons
oKen
know
query
language
upstream
of
search
engine
• No
readily
available
plugin
paNern
in
Solr
11. Tokenization
• Breaking text into words
• Particularly difficult with CJK languages
• Find
the
words:
帰国後ハーバード大学に入学を認められていたもの
12. Decompounding
• Breaking compound words into subcomponents
• Common in German, Dutch, Korean
• Samstagmorgen
Samstag,
morgen
13. Word Form Normalization
• Reduce word form variations to a canonical representation
• Critical for recall
• Two approaches
• Stemming
• Lemma&za&on
15. Normalization: Lemmatization
• Map words to their dictionary form via morphological analysis
• spoke, speaks, speaking speak
• Higher precision and recall compared to stemming
16. NLP Meets Search
Terms
Inverted
Index
Bill
Clinton
spoke
about
...
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
Document
123
NLP
pipeline
bill,
clinton,
speak,
about
query:
“clinton
speaking”
NLP
pipeline
clinton,
speak
Solr
17. NLP Within Solr
• Maximal precision / recall requires NLP pipeline per language
• NLP pipeline (mostly) specified within Solr field type
• Index / query strategies in Solr
• Field
per
language
• Core
per
language
• A
new
approach:
Single
mul&lingual
field
26. Approach Comparison: Query Latency
Experimental Setup
• Corpus: Wikipedia across 9 languages (9 million articles)
• Queries: 1000 most frequently used terms for each language, randomized
• JMeter running 1 hour for each of 6 test runs
160
140
120
100
80
60
40
20
0
1
4
9
Field
per
lang
Core
per
lang
Avg
latency
(ms)
#
languages
queried
27. An Alternative Approach
All languages in a single field
• Requires custom meta field type that is applies per-language
concrete field type(s)
• Patch submitted to Solr
cf. Solr In Action / Trey Grainger
https://github.com/treygrainger/solr-in-action
28. An Alternative Approach
Terms
Inverted
Index
query:
“[en,
es]clinton
speaking”
Inspect
[en,
es],
apply
English
and
Spanish
field
types
to
“clinton
speaking”,
merge
results
clinton,
speak
term
document
IDs
...
...
clinton
…,
123,
...
...
...
speak
…,
123,
...
29. An Alternative Approach
• Results scoring potentially worse than other approaches
• IDF thrown off with single field
• e.g.,
soy
common
in
Spanish,
rela&vely
rare
in
English
• Consider
a
query
for
“soy
dessert
recipe”
against
a
corpus
of
English
and
Spanish
recipes
30. Enhancing NLP Pipeline
Limitations of NLP in Solr out of the box
• Poor precision / performance of CJK tokenization
• Poor precision / recall of stemmers (no lemmatizers)
• Poor recall due to lack of decompounding
RoseNe
to
the
rescue!
31. CJK Tokenization
ケネディはマサチューセッツ
• Rosette: ケネディ, は, マサチューセッツ
• Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サチ, チュ, ュー, ーセ,
セッ, ッツ
• How does this impact precision, recall, index size, speed?