Language support and linguisticsin Apache Lucene™ and Apache Solr™ and the eco-systemGaute Lambertsen Christian Moengl@ati...
Christian Moen• MSc. in computer science, University of Oslo, Norway• Worked with search at FAST (now Microsoft) for 10 ye...
Gaute Lambertsen• MSc. in computer science, Ritsumeikan University• Japan Science and Technology Agency (JST)• Urban ad-ho...
About us• We are a small company based in Tokyo• Our customers are typically big companies “everywhere”• We focus on innov...
About this talk• Search engine review• Natural language processing and search• Examples from different languages• Measurin...
Hands-on 1: Working with analyzers in codeHands-on 4: Text processing using OpenNLPHands-on 3: Multi-lingual search with S...
What is a search engine?
Documents1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very funTwo documents (1 & 2) with English text1
Text segmentation1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very fun1 Sushi is very tasty in Japan...
Text segmentation1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very fun1 Sushi is very tasty in Japan...
Document indexing1 sushi is very tasty in japan2 visiting the tsukiji fish market is very funTokenized documents with norma...
Document indexingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 21 sushi is very tasty in...
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2queryvery tasty sushi
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 21hitsANDvery tasty sushiparsed q...
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2queryvisit fun market
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed queryv...
Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query ...
What’s the problem?Search engines are notmagical answering machinesThey match terms in queriesagainst terms in documents,a...
Key takeawaysText processing affects search quality inbig way because it affects matchingThe “magic” of a search engine is...
Natural language and search
日本語English Deutsch Français ‫ا&%$#"ى‬
EnglishPale ale is a beer made through warmfermentation using pale malt and is one ofthe worlds major beer styles.
EnglishHow do we want to index worlds??Pale ale is a beer made through warmfermentation using pale malt and is one ofthe w...
EnglishHow do we want to index worlds??Should a search for style match styles?And should ferment match fermentation??Pale ...
GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.
GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.The Oktober...
GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.The Oktober...
GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.The Oktober...
FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.
FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.Champagne is a French sparkl...
FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.How do we want to search é, ...
FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.How do we want to search é, ...
Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬...
Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬...
Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬...
Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬...
Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬...
Arabic)+. ‫ا1%م‬ ‫ر53ز‬ 75 ‫ر59ا‬ ":#‫ا=<ـــــــ‬ "#$%&‫ا‬ ‫3ة‬AB‫ا‬ %CD&E.F$%&‫ا‬ GH&‫ا‬ IJ ‫ا&%ب‬Original Arabian coffee ...
Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬...
JapaneseJR 新宿 駅 の 近くに ビールを飲みに行こう か?
JapaneseShall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
JapaneseShall we go for a beer near JR Shinjuku station?What are the words in this sentence?? What are the words in this s...
JapaneseShall we go for a beer near JR Shinjuku station?What are the words in this sentence??Words are implicit in Japanes...
JapaneseJR 新宿 駅 の 近くに ビールを飲みに行こう か?What are the words in this sentence??Words are implicit in Japanese - thereis no white ...
JapaneseJR 新宿 駅 の 近くに ビールを飲みに行こう か?What are the words in this sentence??Words are implicit in Japanese - thereis no white ...
JapaneseDo we want 飲む (to drink) to match 飲み??Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
JapaneseDo we want 飲む (to drink) to match 飲み??Do we want ビール to match ビール??Shall we go for a beer near JR Shinjuku statio...
JapaneseDo we want 飲む (to drink) to match 飲み??Do we want ビール to match ビール??Do we want (emoji) to match??Shall we go for a...
Common traits•Segmenting source text into tokens• Dealing with non-space separated languages• Handling punctuation in spac...
Key take-aways• Natural language is very complex• Each language is different with its own set of complexities• We have had...
Search quality measurements
PrecisionFraction of retrieveddocuments that are relevantprecision =| { relevant docs } ∩ { retrieved docs } || { retrieve...
Perfect precisionJust return a single relevant document!precision =| { relevant docs } ∩ { document A } || { document A } ...
Recall| { relevant docs } ∩ { retrieved docs } || { relevant docs } |recall =Fraction of relevantdocuments that are retrie...
Perfect recall| { relevant docs } ∩ { all docs docs } || { relevant docs } |recall =Just return everything!recall =| { rel...
F =Accuracy - Fprecision ・ recallA balanced measure ofprecision and recall(harmonic mean)precision + recall
Precision vs. RecallShould I optimize for precision or recall??
Precision vs. RecallShould I optimize for precision or recall??That depends on your application!
Optimizing for PrecisionOptimize for Precision if hits are plentifuland several results can meet the user’sinformation nee...
Optimizing for RecallOptimize for Recall if not missing anyrelevant hits is your top priorityReturn everything that is rel...
A lot of tuning work is in practice oftenabout improving recall without hurtingprecision!
Linguistics in Lucene and Solr
Core search & indexing library Ready-to-use search serverInterface is an API (usually Java) Interface is an HTTP serverNo ...
Linguistics in Lucene
Simplified architectureIndexdocumentor query
Indexdocumentor queryLucene analysis chain / Analyzer1. Analyzes queries or documents in a pipelined fashionbefore indexin...
What does an Analyzer do??Analyzers
What does an Analyzer do??! Analyzers take text as its input andturns it into a stream of tokensAnalyzers
What does an Analyzer do??! Analyzers take text as its input andturns it into a stream of tokensTokens are produced by a T...
What does an Analyzer do??! Analyzers take text as its input andturns it into a stream of tokensTokens are produced by a T...
Analyzer high-level conceptsTokenizerReaderTokenFilterTokenFilterTokenFilterReader• Stream to be analyzed is provided by a...
Lucene processing exampleLe champagne est protégé par uneappellation dorigine contrôlée.
Le champagne est protégé par une appellation dorigine contrôlée.FrenchAnalyzer
StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.FrenchAnalyzer
StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellat...
StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellat...
StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellat...
StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellat...
LowerCaseFilterle champagne est protégé par une appellation origine contrôlée
LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilter
LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilterchampagne protégé appellation origi...
LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilterchampagne protégé appellation origi...
LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilterchampagne protégé appellation origi...
FrenchAnalyzerchampagn proteg apel origin controlLe champagne est protégé par une appellation dorigine contrôlée.FrenchLig...
FrenchAnalyzerStandardTokenizerVery commonly used tokenizer• Is smart and handles punctuation cleverly• Used by many space...
Analyzer processing model•Analyzers provide a TokenStream• Retrieve it by calling tokenStream(field, reader)• tokenStream()...
Hands-on: Working with analyzers in code
Synonyms
Synonyms• Synonyms are flexible and easy-to-use• Very powerful tools for improving recall• Two types of synonyms• One way/m...
Hands-on: French analysis with synonyms
Linguistics in Solr
Adding document detailsIndex<add><doc><field>∙∙∙</field></doc></add>
Index<add><doc><field>∙∙∙</field></doc></add>Adding document details
Indexid ...title ...body ...<add><doc><field>∙∙∙</field></doc></add>UpdateRequestHandler handles request1. Receives a docu...
Indexid ...title ...body ...UpdateRequestHandler handles request1. Receives a document via HTTP in XML (or JSON, CSV, ...)...
Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2...
Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2...
Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2...
Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2...
Indexid ...title ...body ...genre ...Update chain of UpdateRequestProcessors1. Update processor added a category field by a...
Indexid ...title ...body ...genre ...Update chain of UpdateRequestProcessors1. Update processor added a category fields by ...
Indexid ...title ...body ...genre ...Update chain of UpdateRequestProcessors1. Update processor added a category fields by ...
Indexid ...title ...body ...genre ...id ...title ...body ...genre ...Lucene analyzer chain1. Fields are analyzed individua...
Indexid ...title ...body ...genre ...id ...title ...body ...genre ...Lucene analyzer chain1. No analysis on idAdding docum...
Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Ad...
Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Ad...
Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Ad...
Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Ad...
Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field body being processedid ...Add...
Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field body being processedid ...body ...Add...
Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field body being processedid ...body ...Add...
Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field body being processedid ...body ...Add...
Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field genre being processed2. User a differe...
Indexid ...title ...body ...genre ...title ...Lucene analyzer chain1. Field genre being processed2. User a different analyz...
id ...title ...body ...genre ...IndexLucene analyzer chain1. All fields analyzedAdding document details
Indexid ...title ...body ...genre ...Adding document details
IndexquerySearch details
SearchHandlerIndexquerySearch details
IndexquerySearch componentsSearch details
Analysis chainIndexquerySearch details
IndexquerySearch details
IndexquerySearch componentsSearch details
IndexresultSearchHandlerSearch details
Hands-on: Multi-lingual search with Solr
Languages out-of-the-box in Solr
Field types in schema.xml• text_ar Arabic• text_bg Bulgarian• text_ca Catalan• text_cjk CJK• text_cz Czech• text_da Danish...
Field types in schema.xmlComing soon!LUCENE-4956• text_ar Arabic• text_bg Bulgarian• text_ca Catalan• text_cjk CJK• text_c...
NLP eco-system
Basis Technology• High-end provider of text analytics software• Rosette Linguistics Platform (RLP) highlights• Language an...
Apache OpenNLP• Machine learning toolkit for NLP• Implements a range of common and best-practice algorithms• Very easy-to-...
Hands-on: Basic text processing with OpenNLP
Other eco-system options
Summary
Summary•Getting languages right is hard• Linguistics helps improves search quality•Linguistics in Lucene and Solr• A wide ...
Linguistics in the bigger picture
Linguistics in the bigger picture• Understand your content and your users’ needs• Understand your language and its issues•...
Thanks youJan Høydahl www.cominvent.comThanks for some slide materialBushra ZawaydehThanks for fun Arabic language lessons
Example code• Example code is on Github• https://github.com/atilika/lucenerevolution-2013• Get started using• git clone gi...
Q & A
ありがとうございましたThank you very much!"#$ ‫(&ا‬Vielen DankMerci beaucoup
CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets youin the doorTOMORROWBr...
Upcoming SlideShare
Loading in...5
×

Language support and linguistics in lucene solr & its eco system

5,082

Published on

Presented by Christian Moen, Software Engineer, Atilika Inc.

In search, language handling is often key to getting a good search experience. This talk gives an overview of language handling and linguistics functionality in Lucene/Solr and best-practices for using them to handle Western, Asian and multi-language deployments. Pointers and references within the open source and commercial eco-systems for more advanced linguistics and their applications are also discussed.

The presentation is mix of overview and hands-on best-practices the audience can benefit immediately from in their Lucene/Solr deployments. The eco-system part is meant to inspire how more advanced functionality can be developed by means of the available open source technologies within the Apache eco-system (predominantly) while also highlighting some of the commercial options available.

Published in: Education, Technology
3 Comments
16 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,082
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
138
Comments
3
Likes
16
Embeds 0
No embeds

No notes for slide

Language support and linguistics in lucene solr & its eco system

  1. 1. Language support and linguisticsin Apache Lucene™ and Apache Solr™ and the eco-systemGaute Lambertsen Christian Moengl@atilika.com cm@atilika.com
  2. 2. Christian Moen• MSc. in computer science, University of Oslo, Norway• Worked with search at FAST (now Microsoft) for 10 years• 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway• 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan• Founded アティリカ株式会社 in October, 2009• We help companies innovate using new technologies and good ideas• We do information retrieval, natural language processing and big data• We are based in Tokyo, but we have clients everywhere• We are a small company, but our customers are typically very big companies• Newbie Lucene & Solr Committer• Mostly been working on Japanese language support (Kuromoji) so far• Please write me on cm@atilika.com or cm@apache.org
  3. 3. Gaute Lambertsen• MSc. in computer science, Ritsumeikan University• Japan Science and Technology Agency (JST)• Urban ad-hoc network research lead by Prof. Nobuhiko Nishio,Ritsumeikan University• Nokia Research & Development, Japan• Hardware and software prototypes for research applications.• Sony Digital Network Applications, Japan• Various software development for consumer electronics• Includes software for consumer hits Sony CyberShot, etc.• Joined Atilika in April, 2012• Please write me on gaute@atilika.com
  4. 4. About us• We are a small company based in Tokyo• Our customers are typically big companies “everywhere”• We focus on innovation - and help big companiesinnovate using new technologies and new ideas• We are software engineers, but like business stuff, too• We know a bit about search, NLP and big data• We offer consulting services and software products• We are supporters of open source software• We are hiring! Write us using careers@atilika.com
  5. 5. About this talk• Search engine review• Natural language processing and search• Examples from different languages• Measuring search quality• Linguistics in Lucene and Solr• Architecture overview• Lucene Analyzers• Adding and searching documents in Solr• Code examples• NLP eco-system• The bigger picture
  6. 6. Hands-on 1: Working with analyzers in codeHands-on 4: Text processing using OpenNLPHands-on 3: Multi-lingual search with SolrHands-on 2: French analysis with synonyms
  7. 7. What is a search engine?
  8. 8. Documents1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very funTwo documents (1 & 2) with English text1
  9. 9. Text segmentation1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very fun1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very funDocuments are turned into searchable terms (tokenization)Two documents (1 & 2) with English text12
  10. 10. Text segmentation1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very fun1 Sushi is very tasty in Japan2 Visiting the Tsukiji fish market is very fun1 sushi is very tasty in japan2 visiting the tsukiji fish market is very funDocuments are turned into searchable terms (tokenization)Two documents (1 & 2) with English textTerms/tokens are converted to lowercase form (normalization)123
  11. 11. Document indexing1 sushi is very tasty in japan2 visiting the tsukiji fish market is very funTokenized documents with normalized tokens
  12. 12. Document indexingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 21 sushi is very tasty in japan2 visiting the tsukiji fish market is very funTokenized documents with normalized tokensInverted index - tokens are mapped to the document ids that contain them
  13. 13. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2
  14. 14. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2queryvery tasty sushi
  15. 15. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
  16. 16. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
  17. 17. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
  18. 18. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvery tasty sushiparsed query
  19. 19. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 21hitsANDvery tasty sushiparsed query
  20. 20. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2
  21. 21. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2queryvisit fun market
  22. 22. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query
  23. 23. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query
  24. 24. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query
  25. 25. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed queryvisit ≠ visiting
  26. 26. Searchingsushi 1is 1 2very 1 2tasty 1in 1japan 1visiting 2the 2tsukiji 2fish 2market 2fun 2ANDvisit fun marketparsed query no hits(all terms need to match)
  27. 27. What’s the problem?Search engines are notmagical answering machinesThey match terms in queriesagainst terms in documents,and order matches by rank!!
  28. 28. Key takeawaysText processing affects search quality inbig way because it affects matchingThe “magic” of a search engine is oftenprovided by high quality text processingGarbage in Garbage out!!
  29. 29. Natural language and search
  30. 30. 日本語English Deutsch Français ‫ا&%$#"ى‬
  31. 31. EnglishPale ale is a beer made through warmfermentation using pale malt and is one ofthe worlds major beer styles.
  32. 32. EnglishHow do we want to index worlds??Pale ale is a beer made through warmfermentation using pale malt and is one ofthe worlds major beer styles.
  33. 33. EnglishHow do we want to index worlds??Should a search for style match styles?And should ferment match fermentation??Pale ale is a beer made through warmfermentation using pale malt and is one ofthe worlds major beer styles.
  34. 34. GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.
  35. 35. GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.The Oktoberfest is the world’s largest festival and ittakes place in the Bavarian capital Munich.
  36. 36. GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.The Oktoberfest is the world’s largest festival and ittakes place in the Bavarian capital Munich.How do we want to search ü, ö and ß??
  37. 37. GermanDas Oktoberfest ist das größte Volksfestder Welt und es findet in der bayerischenLandeshauptstadt München.The Oktoberfest is the world’s largest festival and ittakes place in the Bavarian capital Munich.How do we want to search ü, ö and ß??Do we want a search for hauptstadt tomatch Landeshauptstadt??
  38. 38. FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.
  39. 39. FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.Champagne is a French sparkling wine with a protecteddesignation of origin.
  40. 40. FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.How do we want to search é, ç and ô??Champagne is a French sparkling wine with a protecteddesignation of origin.
  41. 41. FrenchLe champagne est un vin pétillant françaisprotégé par une appellation doriginecontrôlée.How do we want to search é, ç and ô??How do we want to search dorigine??Champagne is a French sparkling wine with a protecteddesignation of origin.
  42. 42. Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬ َGHَ&‫ا‬ IِJ ْ‫َب‬%َ&‫ا‬
  43. 43. Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬ َGHَ&‫ا‬ IِJ ْ‫َب‬%َ&‫ا‬Reads from right to left
  44. 44. Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬ َGHَ&‫ا‬ IِJ ْ‫َب‬%َ&‫ا‬Original Arabian coffee is considered a symbol ofgenerosity among the Arabs in the Arab world.
  45. 45. Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬ َGHَ&‫ا‬ IِJ ْ‫َب‬%َ&‫ا‬Original Arabian coffee is considered a symbol ofgenerosity among the Arabs in the Arab world.How do we want to search "َ:ْ#‫ـــــــ‬ ِ<=‫?ا‬?
  46. 46. Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬ َGHَ&‫ا‬ IِJ ْ‫َب‬%َ&‫ا‬Original Arabian coffee is considered a symbol ofgenerosity among the Arabs in the Arab world.How do we want to search "َ:ْ#‫ـــــــ‬ ِ<=‫?ا‬?Do we want to normalize diacritics??
  47. 47. Arabic)+. ‫ا1%م‬ ‫ر53ز‬ 75 ‫ر59ا‬ ":#‫ا=<ـــــــ‬ "#$%&‫ا‬ ‫3ة‬AB‫ا‬ %CD&E.F$%&‫ا‬ GH&‫ا‬ IJ ‫ا&%ب‬Original Arabian coffee is considered a symbol ofgenerosity among the Arabs in the Arab world.How do we want to search "َ:ْ#‫ـــــــ‬ ِ<=‫?ا‬?Do we want to normalize diacritics??Diacritics normalized (removed)
  48. 48. Arabic)ْ+,ِ. ‫َم‬%َ1‫ا‬ ‫ْز‬3ُ5ُ‫ر‬ 7ِ5 ‫ًا‬9ْ5َ‫ر‬ "َ:ْ#‫ـــــــ‬ ِ<=‫ا‬ >?#ِ$َ%َ&‫ا‬ ‫َة‬3ْAَB‫ا‬ %َC,َD,ْ&ُE.Fِ$َ%َ&‫ا‬ َGHَ&‫ا‬ IِJ ْ‫َب‬%َ&‫ا‬Original Arabian coffee is considered a symbol ofgenerosity among the Arabs in the Arab world.How do we want to search "َ:ْ#‫ـــــــ‬ ِ<=‫?ا‬?Do we want to correct the commonspelling mistake for IِJ and ‫?ه‬?Do we want to normalize diacritics??
  49. 49. JapaneseJR 新宿 駅 の 近くに ビールを飲みに行こう か?
  50. 50. JapaneseShall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  51. 51. JapaneseShall we go for a beer near JR Shinjuku station?What are the words in this sentence?? What are the words in this sentence?Which tokens do we index?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  52. 52. JapaneseShall we go for a beer near JR Shinjuku station?What are the words in this sentence??Words are implicit in Japanese - thereis no white space that separates them!What are the words in this sentence?Which tokens do we index?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  53. 53. JapaneseJR 新宿 駅 の 近くに ビールを飲みに行こう か?What are the words in this sentence??Words are implicit in Japanese - thereis no white space that separates them!What are the words in this sentence?Which tokens do we index?Shall we go for a beer near JR Shinjuku station?But how do we find the tokens??JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  54. 54. JapaneseJR 新宿 駅 の 近くに ビールを飲みに行こう か?What are the words in this sentence??Words are implicit in Japanese - thereis no white space that separates them!What are the words in this sentence?Which tokens do we index?Shall we go for a beer near JR Shinjuku station?But how do we find the tokens??
  55. 55. JapaneseDo we want 飲む (to drink) to match 飲み??Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  56. 56. JapaneseDo we want 飲む (to drink) to match 飲み??Do we want ビール to match ビール??Shall we go for a beer near JR Shinjuku station?Does half-width match full-width?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  57. 57. JapaneseDo we want 飲む (to drink) to match 飲み??Do we want ビール to match ビール??Do we want (emoji) to match??Shall we go for a beer near JR Shinjuku station?Does half-width match full-width?JR 新宿 駅 の 近くに ビールを飲みに行こう か?
  58. 58. Common traits•Segmenting source text into tokens• Dealing with non-space separated languages• Handling punctuation in space separated languages• Segmenting compounds into their parts• Apply relevant linguistic normalizations• Character normalization• Morphological (or grammatical) normalizations• Spelling variations• Synonyms and stopwords
  59. 59. Key take-aways• Natural language is very complex• Each language is different with its own set of complexities• We have had a high level look at languages• But there is also...• Search needs per-language processing• Many considerations to be made (often application-specific)Greek Hebrew ChineseKoreanRussian ThaiSpanish and many more...JapaneseEnglish German French Arabic
  60. 60. Search quality measurements
  61. 61. PrecisionFraction of retrieveddocuments that are relevantprecision =| { relevant docs } ∩ { retrieved docs } || { retrieved docs } |
  62. 62. Perfect precisionJust return a single relevant document!precision =| { relevant docs } ∩ { document A } || { document A } |precision =| { document A } || { document A } |= 1
  63. 63. Recall| { relevant docs } ∩ { retrieved docs } || { relevant docs } |recall =Fraction of relevantdocuments that are retrieved
  64. 64. Perfect recall| { relevant docs } ∩ { all docs docs } || { relevant docs } |recall =Just return everything!recall =| { relevant docs docs } || { relevant docs docs } |= 1
  65. 65. F =Accuracy - Fprecision ・ recallA balanced measure ofprecision and recall(harmonic mean)precision + recall
  66. 66. Precision vs. RecallShould I optimize for precision or recall??
  67. 67. Precision vs. RecallShould I optimize for precision or recall??That depends on your application!
  68. 68. Optimizing for PrecisionOptimize for Precision if hits are plentifuland several results can meet the user’sinformation needs.Make sure people can find a few goodrelevant hits rather than overwhelmingthem with likely irrelevant results!!Web search, knowledge management!
  69. 69. Optimizing for RecallOptimize for Recall if not missing anyrelevant hits is your top priorityReturn everything that is relevant, butprobably also return a lot of documentsthat are not!!Compliance, patent search, digitalforensics, etc.!
  70. 70. A lot of tuning work is in practice oftenabout improving recall without hurtingprecision!
  71. 71. Linguistics in Lucene and Solr
  72. 72. Core search & indexing library Ready-to-use search serverInterface is an API (usually Java) Interface is an HTTP serverNo configuration files Configuration filesNo schema Has a schemaRuns on a single server Runs on multiple servers in ascalable and fault-tolerant fashionHas linguistics support for arange of languagesAdds functionality for facets, spell-checking, highlighting, etc.Great for embedding search Uses Lucene for search & indexing
  73. 73. Linguistics in Lucene
  74. 74. Simplified architectureIndexdocumentor query
  75. 75. Indexdocumentor queryLucene analysis chain / Analyzer1. Analyzes queries or documents in a pipelined fashionbefore indexing or search2. Analysis itself is done by an analyzer on a per field basis3. Key plug-in point for linguistics in LuceneSimplified architecture
  76. 76. What does an Analyzer do??Analyzers
  77. 77. What does an Analyzer do??! Analyzers take text as its input andturns it into a stream of tokensAnalyzers
  78. 78. What does an Analyzer do??! Analyzers take text as its input andturns it into a stream of tokensTokens are produced by a Tokenizer!Analyzers
  79. 79. What does an Analyzer do??! Analyzers take text as its input andturns it into a stream of tokensTokens are produced by a TokenizerTokens can be processed further by achain of TokenFilters downstream!!Analyzers
  80. 80. Analyzer high-level conceptsTokenizerReaderTokenFilterTokenFilterTokenFilterReader• Stream to be analyzed is provided by a Reader (from java.io)• Can have chain of associated CharFilters (not discussed)Tokenizer• Segments text provider by reader into tokens• Most interesting things happen in incrementToken() methodTokenFilter• Updates, mutates or enriches tokens• Most interesting things happen in incrementToken() methodTokenFilter...TokenFilter...
  81. 81. Lucene processing exampleLe champagne est protégé par uneappellation dorigine contrôlée.
  82. 82. Le champagne est protégé par une appellation dorigine contrôlée.FrenchAnalyzer
  83. 83. StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.FrenchAnalyzer
  84. 84. StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellation dorigine contrôléeFrenchAnalyzer
  85. 85. StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellation dorigine contrôléeElisionFilterFrenchAnalyzer
  86. 86. StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellation dorigine contrôléeElisionFilterLe champagne est protégé par une appellation origine contrôléeFrenchAnalyzer
  87. 87. StandardTokenizerLe champagne est protégé par une appellation dorigine contrôlée.Le champagne est protégé par une appellation dorigine contrôléeElisionFilterLe champagne est protégé par une appellation origine contrôléeFrenchAnalyzerLowerCaseFilter
  88. 88. LowerCaseFilterle champagne est protégé par une appellation origine contrôlée
  89. 89. LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilter
  90. 90. LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilterchampagne protégé appellation origine contrôlée
  91. 91. LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilterchampagne protégé appellation origine contrôléeFrenchLightStemFilter
  92. 92. LowerCaseFilterle champagne est protégé par une appellation origine contrôléeStopFilterchampagne protégé appellation origine contrôléechampagn proteg apel origin controlFrenchLightStemFilter
  93. 93. FrenchAnalyzerchampagn proteg apel origin controlLe champagne est protégé par une appellation dorigine contrôlée.FrenchLightStemFilterStandardTokenizerElisionFilterLowerCaseFilterStopFilter
  94. 94. FrenchAnalyzerStandardTokenizerVery commonly used tokenizer• Is smart and handles punctuation cleverly• Used by many space-based languages• Based on a JFlex grammerElisionFilterRemoves elisions from tokensSee solr/collection1/conf/lang/contractions_fr.txtLowerCaseFilter Lowercases tokensStopFilterRemoved very common wordsSee solr/collection1/conf/lang/stopwords_fr.txtFrenchLightStemmerStems French wordsAlso does accent normalization (removal)
  95. 95. Analyzer processing model•Analyzers provide a TokenStream• Retrieve it by calling tokenStream(field, reader)• tokenStream() bundles together tokenizers andany additional filters necessary for analysis•Input is advanced by incrementToken()• Information about the token itself is provided byso-called TokenAttributes attached to the stream• Attribute for term text, offset, token type, etc.• TokenAttributes are updated on incrementToken()
  96. 96. Hands-on: Working with analyzers in code
  97. 97. Synonyms
  98. 98. Synonyms• Synonyms are flexible and easy-to-use• Very powerful tools for improving recall• Two types of synonyms• One way/mapping “sparkling wine => champagne”• Two way/equivalence “aoc, appellation dorigine contrôlée”• Can be applied index-time or query-time• Apply synonyms on one side - not both• Best practice is to apply synonyms query-side• Allows for updating synonyms without reindexing• Allows for turning synonyms on and off easily
  99. 99. Hands-on: French analysis with synonyms
  100. 100. Linguistics in Solr
  101. 101. Adding document detailsIndex<add><doc><field>∙∙∙</field></doc></add>
  102. 102. Index<add><doc><field>∙∙∙</field></doc></add>Adding document details
  103. 103. Indexid ...title ...body ...<add><doc><field>∙∙∙</field></doc></add>UpdateRequestHandler handles request1. Receives a document via HTTP in XML (or JSON, CSV, ...)2. Converts document to a SolrInputDocument3. Activates the update chainAdding document details
  104. 104. Indexid ...title ...body ...UpdateRequestHandler handles request1. Receives a document via HTTP in XML (or JSON, CSV, ...)2. Converts document to a SolrInputDocument3. Activates the update chain<add><doc><field>∙∙∙</field></doc></add>Adding document details
  105. 105. Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fieldsor do other processing as desiredAdding document details
  106. 106. Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fieldsor do other processing as desiredAdding document details
  107. 107. Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fieldsor do other processing as desiredAdding document details
  108. 108. Indexid ...title ...body ...Update chain of UpdateRequestProcessors1. Processes a document at a time with operation (add)2. Plugin logic can mutate SolrInputDocument, i.e. add fieldsor do other processing as desiredAdding document details
  109. 109. Indexid ...title ...body ...genre ...Update chain of UpdateRequestProcessors1. Update processor added a category field by analyzing body2. Finish by calling RunUpdateProcessor (usually)Adding document details
  110. 110. Indexid ...title ...body ...genre ...Update chain of UpdateRequestProcessors1. Update processor added a category fields by analyzing body2. Finish by calling RunUpdateProcessor (usually)Adding document details
  111. 111. Indexid ...title ...body ...genre ...Update chain of UpdateRequestProcessors1. Update processor added a category fields by analyzing body2. Finish by calling RunUpdateProcessor (usually)id ...title ...body ...genre ...Adding document details
  112. 112. Indexid ...title ...body ...genre ...id ...title ...body ...genre ...Lucene analyzer chain1. Fields are analyzed individuallyAdding document details
  113. 113. Indexid ...title ...body ...genre ...id ...title ...body ...genre ...Lucene analyzer chain1. No analysis on idAdding document details
  114. 114. Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Adding document details
  115. 115. Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Adding document details
  116. 116. Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Adding document details
  117. 117. Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field title being processedid ...Adding document details
  118. 118. Indexid ...title ...body ...genre ...title ...body ...genre ...Lucene analyzer chain1. Field body being processedid ...Adding document details
  119. 119. Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field body being processedid ...body ...Adding document details
  120. 120. Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field body being processedid ...body ...Adding document details
  121. 121. Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field body being processedid ...body ...Adding document details
  122. 122. Indexid ...title ...body ...genre ...title ...genre ...Lucene analyzer chain1. Field genre being processed2. User a different analyzer chainid ...body ...Adding document details
  123. 123. Indexid ...title ...body ...genre ...title ...Lucene analyzer chain1. Field genre being processed2. User a different analyzer chainid ...body ...genre ...Adding document details
  124. 124. id ...title ...body ...genre ...IndexLucene analyzer chain1. All fields analyzedAdding document details
  125. 125. Indexid ...title ...body ...genre ...Adding document details
  126. 126. IndexquerySearch details
  127. 127. SearchHandlerIndexquerySearch details
  128. 128. IndexquerySearch componentsSearch details
  129. 129. Analysis chainIndexquerySearch details
  130. 130. IndexquerySearch details
  131. 131. IndexquerySearch componentsSearch details
  132. 132. IndexresultSearchHandlerSearch details
  133. 133. Hands-on: Multi-lingual search with Solr
  134. 134. Languages out-of-the-box in Solr
  135. 135. Field types in schema.xml• text_ar Arabic• text_bg Bulgarian• text_ca Catalan• text_cjk CJK• text_cz Czech• text_da Danish• text_de German• text_el Greek• text_es Spanish• text_eu Basque• text_fa Farsi• text_fi Finnish• text_fr French• text_ga Irish• text_gl Galician• text_hi Hindi• text_hu Hungarian• text_hy Armenian• text_id Indonedian• text_it Italian• text_lv Latvian• text_nl Dutch• text_no Norwegian• text_pt Portuguese• text_ro Romanian• text_ru Russian• text_sv Swedish• text_th Thai• text_fr Turkish
  136. 136. Field types in schema.xmlComing soon!LUCENE-4956• text_ar Arabic• text_bg Bulgarian• text_ca Catalan• text_cjk CJK• text_cz Czech• text_da Danish• text_de German• text_el Greek• text_es Spanish• text_eu Basque• text_fa Farsi• text_fi Finnish• text_fr French• text_ga Irish• text_gl Galician• text_hi Hindi• text_hu Hungarian• text_hy Armenian• text_id Indonedian• text_it Italian• text_lv Latvian• text_nl Dutch• text_no Norwegian• text_pt Portuguese• text_ro Romanian• text_ru Russian• text_sv Swedish• text_th Thai• text_fr Turkish• text_kr Korean
  137. 137. NLP eco-system
  138. 138. Basis Technology• High-end provider of text analytics software• Rosette Linguistics Platform (RLP) highlights• Language and encoding identification(55 languages and 45 encodings)• Segmentation for Chinese, Japanese and Korean• De-compounding for German, Dutch, Korean, etc.• Lemmatization for a range of languages• Part-of-speech tagging for a range of language• Sentence boundary detection• Named entity extraction• Name indexing, transliteration and matching• Integrates well with Lucene/Solr
  139. 139. Apache OpenNLP• Machine learning toolkit for NLP• Implements a range of common and best-practice algorithms• Very easy-to-use tools and APIs• Features and applications• Tokenization• Sentence segmentation• Part-of-speech tagging• Named entity recognition• Chunking• Licensing terms• Code itself has an Apache License 2.0• Some models are available, but licensing terms and F-scores are unclear...• See LUCENE-2899 for OpenNLP a Lucene Analyzer (work-in-progress)
  140. 140. Hands-on: Basic text processing with OpenNLP
  141. 141. Other eco-system options
  142. 142. Summary
  143. 143. Summary•Getting languages right is hard• Linguistics helps improves search quality•Linguistics in Lucene and Solr• A wide range of languages supported out-of-the-box• Considerations to be made on indexing and query side• Lucene Analyzers work on a per-field level• Solr UpdateRequestProcessors work on the document level• Solr has functionality for automatically detecting language•Linguistics options also available in the eco-system
  144. 144. Linguistics in the bigger picture
  145. 145. Linguistics in the bigger picture• Understand your content and your users’ needs• Understand your language and its issues• Understand what users want from search• Do you have issues with recall?• Consider synonyms, stemming• Consider WordDelimiterFilter, phonetic matching• Do you have issues with precision?• Consider using ANDs instead of ORs• Improve content quality? Search fewer fields?• Is some content more important than other?• Consider boosting content with a boost query
  146. 146. Thanks youJan Høydahl www.cominvent.comThanks for some slide materialBushra ZawaydehThanks for fun Arabic language lessons
  147. 147. Example code• Example code is on Github• https://github.com/atilika/lucenerevolution-2013• Get started using• git clone git://github.com/atilika/lucenerevolution-2013.git• less lucenerevolution-2013/README.md• Contact us if you have questions• hello@atilika.com•
  148. 148. Q & A
  149. 149. ありがとうございましたThank you very much!"#$ ‫(&ا‬Vielen DankMerci beaucoup
  150. 150. CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets youin the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30CONTACT (optional)Name (optional)email address (optional)

×