SlideShare a Scribd company logo
1 of 34
Download to read offline
Optimizing Multilingual Search 
David Troiano 
Principal Software Engineer, Basis Technology 
dtroiano@basistech.com
Talk Overview 
• The problem we’re trying to solve 
• Natural language processing (NLP) 
• Approaches to multilingual search in Solr
A Multilingual Search Example
The Goal 
Build a search engine where: 
• Document corpus spans multiple languages 
• Poten&ally 
mixed 
language 
documents 
• Queries within a language, or potentially spanning multiple
NLP Meets Search (Querying) 
Terms 
Inverted 
Index 
query: 
“clinton 
speaking” 
NLP 
pipeline 
clinton, 
speak 
term 
document 
IDs 
... 
... 
clinton 
…, 
123, 
... 
... 
... 
speak 
…, 
123, 
...
NLP Meets Search (Indexing) 
Document 
123 
Terms 
Inverted 
Index 
Bill 
Clinton 
spoke 
about 
... 
NLP 
pipeline 
bill, 
clinton, 
speak, 
about 
term 
document 
IDs 
... 
... 
clinton 
…, 
123, 
... 
... 
... 
speak 
…, 
123, 
...
NLP Meets Search 
Terms 
Inverted 
Index 
Bill 
Clinton 
spoke 
about 
... 
term 
document 
IDs 
... 
... 
clinton 
…, 
123, 
... 
... 
... 
speak 
…, 
123, 
... 
Document 
123 
NLP 
pipeline 
bill, 
clinton, 
speak, 
about 
query: 
“clinton 
speaking” 
NLP 
pipeline 
clinton, 
speak
The NLP Pipeline 
• Language Detection 
• Tokenization 
• Decompounding 
• Word Form Normalization
Language Detection 
• Often required when indexing 
• Typically not used at query time 
• Lower 
accuracy 
on 
short 
strings 
• Some&mes 
unsolvable 
even 
to 
humans, 
e.g., 
named 
en&&es 
• End 
user 
applica&ons 
oKen 
know 
query 
language 
upstream 
of 
search 
engine 
• No 
readily 
available 
plugin 
paNern 
in 
Solr
Tokenization 
• Breaking text into words 
• Particularly difficult with CJK languages 
• Find 
the 
words: 
帰国後ハーバード大学に入学を認められていたもの
Decompounding 
• Breaking compound words into subcomponents 
• Common in German, Dutch, Korean 
• Samstagmorgen 
Samstag, 
morgen
Word Form Normalization 
• Reduce word form variations to a canonical representation 
• Critical for recall 
• Two approaches 
• Stemming 
• Lemma&za&on
Normalization: Stemming 
• Simple rules-based approach 
• “Chop off the end” 
• arsenal, 
arsenic 
arsen
Normalization: Lemmatization 
• Map words to their dictionary form via morphological analysis 
• spoke, speaks, speaking speak 
• Higher precision and recall compared to stemming
NLP Meets Search 
Terms 
Inverted 
Index 
Bill 
Clinton 
spoke 
about 
... 
term 
document 
IDs 
... 
... 
clinton 
…, 
123, 
... 
... 
... 
speak 
…, 
123, 
... 
Document 
123 
NLP 
pipeline 
bill, 
clinton, 
speak, 
about 
query: 
“clinton 
speaking” 
NLP 
pipeline 
clinton, 
speak 
Solr
NLP Within Solr 
• Maximal precision / recall requires NLP pipeline per language 
• NLP pipeline (mostly) specified within Solr field type 
• Index / query strategies in Solr 
• Field 
per 
language 
• Core 
per 
language 
• A 
new 
approach: 
Single 
mul&lingual 
field
Field Per Language 
schema.xml 
<field name="content_cjk" type="text_cjk" indexed="true" stored="true" /> 
<field name="content_eng" type="text_eng" indexed="true" stored="true" /> 
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.StandardTokenizerFactory"/> 
<filter class="solr.CJKWidthFilterFactory"/> 
<filter class="solr.CJKBigramFilterFactory"/> 
</analyzer> 
</fieldType> 
query 
http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng
Field Per Language 
http://<solr url>/solr/articles/selec tq?=qs=esreirei%e2%02a0 a&defType=edismax&qf=content_cjk%20content_eng
Field Per Language 
http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax &qf=content_cjk%20content_eng
Field Per Language 
http://<solr url>/solr/articles/select?q=serie%20a&defType=ediqsfm=acxo&nqtfe=ncto_nctjekn%t2_0ccjokn%t2e0ncto_netnegn t_eng
Core Per Language 
CJK core’s schema.xml 
<field name="content" type="text_cjk" indexed="true" stored="true" 
multiValued="true"/> 
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.StandardTokenizerFactory"/> 
<filter class="solr.CJKWidthFilterFactory"/> 
<filter class="solr.CJKBigramFilterFactory"/> 
</analyzer> 
</fieldType> 
query 
http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng
Core Per Language 
http://.../select?qq==ccoonntteenntt::sseerriiee%%2200aa &shards=<url>/articles_cjk,<url>/articles_eng
Core Per Language 
http://.../select?q=content:series%h2a0rad&ss=h<aurrdls>=/<aurrtli>c/laerst_iccjlke,s<_ucrjlk>,/<aurrtli>c/laerst_iecnlge s_eng
Approach Comparison 
Field 
Per 
Language 
Core 
Per 
Language 
Simplicity 
Speed 
✔ 
✔
Approach Comparison: Query Latency 
Experimental Setup 
• Corpus: Wikipedia across 9 languages (9 million articles) 
• Queries: 1000 most frequently used terms for each language, randomized 
• JMeter running 1 hour for each of 6 test runs 
160 
140 
120 
100 
80 
60 
40 
20 
0 
1 
4 
9 
Field 
per 
lang 
Core 
per 
lang 
Avg 
latency 
(ms) 
# 
languages 
queried
An Alternative Approach 
All languages in a single field 
• Requires custom meta field type that is applies per-language 
concrete field type(s) 
• Patch submitted to Solr 
cf. Solr In Action / Trey Grainger 
https://github.com/treygrainger/solr-in-action
An Alternative Approach 
Terms 
Inverted 
Index 
query: 
“[en, 
es]clinton 
speaking” 
Inspect 
[en, 
es], 
apply 
English 
and 
Spanish 
field 
types 
to 
“clinton 
speaking”, 
merge 
results 
clinton, 
speak 
term 
document 
IDs 
... 
... 
clinton 
…, 
123, 
... 
... 
... 
speak 
…, 
123, 
...
An Alternative Approach 
• Results scoring potentially worse than other approaches 
• IDF thrown off with single field 
• e.g., 
soy 
common 
in 
Spanish, 
rela&vely 
rare 
in 
English 
• Consider 
a 
query 
for 
“soy 
dessert 
recipe” 
against 
a 
corpus 
of 
English 
and 
Spanish 
recipes
Enhancing NLP Pipeline 
Limitations of NLP in Solr out of the box 
• Poor precision / performance of CJK tokenization 
• Poor precision / recall of stemmers (no lemmatizers) 
• Poor recall due to lack of decompounding 
RoseNe 
to 
the 
rescue!
CJK Tokenization 
ケネディはマサチューセッツ 
• Rosette: ケネディ, は, マサチューセッツ 
• Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サチ, チュ, ュー, ーセ, 
セッ, ッツ 
• How does this impact precision, recall, index size, speed?
Rosette In Solr 
<fieldType name="text_zho" class="solr.TextField"> 
<analyzer type="index"> 
<tokenizer 
class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" 
rootDirectory="<rootDir>" 
language="zho" /> 
<filter 
class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" 
rootDirectory="<rootDir>" 
language="zho" /> 
</analyzer> 
</fieldType> 
cf. http://www.basistech.com/search-essentials/
Wrapping Up 
• Multilingual search is everywhere 
• Solr as your multilingual search platform 
• Search quality hinges on quality of NLP tools
Optimizing Multilingual Search 
David Troiano 
Principal Software Engineer, Basis Technology 
dtroiano@basistech.com

More Related Content

Similar to Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLRBasis Technology
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with ElasticsearchAleksander Stensby
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryAniruddha Chakrabarti
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big DataSameer Wadkar
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Single-Sourcing and Localization
Single-Sourcing and LocalizationSingle-Sourcing and Localization
Single-Sourcing and LocalizationLaura Dent
 
Cultural Awareness, Localization and the Impact on Content Creation of User I...
Cultural Awareness, Localization and the Impact on Content Creation of User I...Cultural Awareness, Localization and the Impact on Content Creation of User I...
Cultural Awareness, Localization and the Impact on Content Creation of User I...Information Development World
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Laura Dent
 
DSL Construction rith Ruby
DSL Construction rith RubyDSL Construction rith Ruby
DSL Construction rith RubyThoughtWorks
 

Similar to Optimizing Multilingual Search: Presented by David Troiano, Basis Technology (20)

Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
Ir 03
Ir   03Ir   03
Ir 03
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Single-Sourcing and Localization
Single-Sourcing and LocalizationSingle-Sourcing and Localization
Single-Sourcing and Localization
 
Cultural Awareness, Localization and the Impact on Content Creation of User I...
Cultural Awareness, Localization and the Impact on Content Creation of User I...Cultural Awareness, Localization and the Impact on Content Creation of User I...
Cultural Awareness, Localization and the Impact on Content Creation of User I...
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
The Typed Index
The Typed IndexThe Typed Index
The Typed Index
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16
 
DSL Construction rith Ruby
DSL Construction rith RubyDSL Construction rith Ruby
DSL Construction rith Ruby
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

  • 1.
  • 2. Optimizing Multilingual Search David Troiano Principal Software Engineer, Basis Technology dtroiano@basistech.com
  • 3. Talk Overview • The problem we’re trying to solve • Natural language processing (NLP) • Approaches to multilingual search in Solr
  • 5. The Goal Build a search engine where: • Document corpus spans multiple languages • Poten&ally mixed language documents • Queries within a language, or potentially spanning multiple
  • 6. NLP Meets Search (Querying) Terms Inverted Index query: “clinton speaking” NLP pipeline clinton, speak term document IDs ... ... clinton …, 123, ... ... ... speak …, 123, ...
  • 7. NLP Meets Search (Indexing) Document 123 Terms Inverted Index Bill Clinton spoke about ... NLP pipeline bill, clinton, speak, about term document IDs ... ... clinton …, 123, ... ... ... speak …, 123, ...
  • 8. NLP Meets Search Terms Inverted Index Bill Clinton spoke about ... term document IDs ... ... clinton …, 123, ... ... ... speak …, 123, ... Document 123 NLP pipeline bill, clinton, speak, about query: “clinton speaking” NLP pipeline clinton, speak
  • 9. The NLP Pipeline • Language Detection • Tokenization • Decompounding • Word Form Normalization
  • 10. Language Detection • Often required when indexing • Typically not used at query time • Lower accuracy on short strings • Some&mes unsolvable even to humans, e.g., named en&&es • End user applica&ons oKen know query language upstream of search engine • No readily available plugin paNern in Solr
  • 11. Tokenization • Breaking text into words • Particularly difficult with CJK languages • Find the words: 帰国後ハーバード大学に入学を認められていたもの
  • 12. Decompounding • Breaking compound words into subcomponents • Common in German, Dutch, Korean • Samstagmorgen Samstag, morgen
  • 13. Word Form Normalization • Reduce word form variations to a canonical representation • Critical for recall • Two approaches • Stemming • Lemma&za&on
  • 14. Normalization: Stemming • Simple rules-based approach • “Chop off the end” • arsenal, arsenic arsen
  • 15. Normalization: Lemmatization • Map words to their dictionary form via morphological analysis • spoke, speaks, speaking speak • Higher precision and recall compared to stemming
  • 16. NLP Meets Search Terms Inverted Index Bill Clinton spoke about ... term document IDs ... ... clinton …, 123, ... ... ... speak …, 123, ... Document 123 NLP pipeline bill, clinton, speak, about query: “clinton speaking” NLP pipeline clinton, speak Solr
  • 17. NLP Within Solr • Maximal precision / recall requires NLP pipeline per language • NLP pipeline (mostly) specified within Solr field type • Index / query strategies in Solr • Field per language • Core per language • A new approach: Single mul&lingual field
  • 18. Field Per Language schema.xml <field name="content_cjk" type="text_cjk" indexed="true" stored="true" /> <field name="content_eng" type="text_eng" indexed="true" stored="true" /> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> </fieldType> query http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng
  • 19. Field Per Language http://<solr url>/solr/articles/selec tq?=qs=esreirei%e2%02a0 a&defType=edismax&qf=content_cjk%20content_eng
  • 20. Field Per Language http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax &qf=content_cjk%20content_eng
  • 21. Field Per Language http://<solr url>/solr/articles/select?q=serie%20a&defType=ediqsfm=acxo&nqtfe=ncto_nctjekn%t2_0ccjokn%t2e0ncto_netnegn t_eng
  • 22. Core Per Language CJK core’s schema.xml <field name="content" type="text_cjk" indexed="true" stored="true" multiValued="true"/> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> </fieldType> query http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng
  • 23. Core Per Language http://.../select?qq==ccoonntteenntt::sseerriiee%%2200aa &shards=<url>/articles_cjk,<url>/articles_eng
  • 24. Core Per Language http://.../select?q=content:series%h2a0rad&ss=h<aurrdls>=/<aurrtli>c/laerst_iccjlke,s<_ucrjlk>,/<aurrtli>c/laerst_iecnlge s_eng
  • 25. Approach Comparison Field Per Language Core Per Language Simplicity Speed ✔ ✔
  • 26. Approach Comparison: Query Latency Experimental Setup • Corpus: Wikipedia across 9 languages (9 million articles) • Queries: 1000 most frequently used terms for each language, randomized • JMeter running 1 hour for each of 6 test runs 160 140 120 100 80 60 40 20 0 1 4 9 Field per lang Core per lang Avg latency (ms) # languages queried
  • 27. An Alternative Approach All languages in a single field • Requires custom meta field type that is applies per-language concrete field type(s) • Patch submitted to Solr cf. Solr In Action / Trey Grainger https://github.com/treygrainger/solr-in-action
  • 28. An Alternative Approach Terms Inverted Index query: “[en, es]clinton speaking” Inspect [en, es], apply English and Spanish field types to “clinton speaking”, merge results clinton, speak term document IDs ... ... clinton …, 123, ... ... ... speak …, 123, ...
  • 29. An Alternative Approach • Results scoring potentially worse than other approaches • IDF thrown off with single field • e.g., soy common in Spanish, rela&vely rare in English • Consider a query for “soy dessert recipe” against a corpus of English and Spanish recipes
  • 30. Enhancing NLP Pipeline Limitations of NLP in Solr out of the box • Poor precision / performance of CJK tokenization • Poor precision / recall of stemmers (no lemmatizers) • Poor recall due to lack of decompounding RoseNe to the rescue!
  • 31. CJK Tokenization ケネディはマサチューセッツ • Rosette: ケネディ, は, マサチューセッツ • Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サチ, チュ, ュー, ーセ, セッ, ッツ • How does this impact precision, recall, index size, speed?
  • 32. Rosette In Solr <fieldType name="text_zho" class="solr.TextField"> <analyzer type="index"> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" rootDirectory="<rootDir>" language="zho" /> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" rootDirectory="<rootDir>" language="zho" /> </analyzer> </fieldType> cf. http://www.basistech.com/search-essentials/
  • 33. Wrapping Up • Multilingual search is everywhere • Solr as your multilingual search platform • Search quality hinges on quality of NLP tools
  • 34. Optimizing Multilingual Search David Troiano Principal Software Engineer, Basis Technology dtroiano@basistech.com