Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Multilingual Search and Text Analytics with Solr - Open Source Search Conference

on

  • 1,529 views

This talk will explore the challenges of Multilingual search, including language-specific issues — like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language ...

This talk will explore the challenges of Multilingual search, including language-specific issues — like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language identification — and the various approaches to configuring your Solr schema. We will also discuss the integration strategies for common text analytics capabilities and the impact of multilingual content on application design.

Solr is a powerful search engine which rapidly gained acceptance as an alternative to commercial search solutions for many applications. There are many features required by organizations to serve their diverse communities, among these is the ability to deliver search excellence in foreign languages. Delivering quality multilingual search involves careful design of schemas and selection of the best linguistic approach for each supported language.

Statistics

Views

Total Views
1,529
Views on SlideShare
1,515
Embed Views
14

Actions

Likes
1
Downloads
22
Comments
0

2 Embeds 14

http://nebeleule.de 13
http://feedly.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Multilingual Search and Text Analytics with Solr - Open Source Search Conference Presentation Transcript

  • 1. Multilingual Search and Text Analyticswith SolrSteve KearnsDirector of Product ManagementBasis TechnologyBasis Technology – Open Source Search Conference 2012 1
  • 2. Agenda  •  Why  is  Language  Important?  •  Approaches  for  language-­‐aware  search  •  Solr  Configura>on  Op>ons   Basis Technology – Open Source Search Conference 2012 2
  • 3. Language  is   Important  Basis Technology – Open Source Search Conference 2012 3
  • 4. Why  is  language  important?  •  Content  is  produced  and  consumed  in  the  na>ve   language  •  Document  collec>ons  oBen  contain  more  than  one   language  •  Each  language  is  unique,  and  presents  different   challenges  to  the  search  engine   Basis Technology – Open Source Search Conference 2012 4
  • 5. Language  is  Complex  •  Tokeniza>on   •  Some  languages  do  not  use  spaces   •  Compound  words  combine  two  or  more  words   •  Conjunc>ons    •  Inflec>on   •  In  grammar,  inflec>on  is  the  modifica>on  of  a  word  to   express  different  gramma>cal  categories  such  as   tense,  gramma>cal  mood,  gramma>cal  voice,  aspect,   person,  number,  gender  and  case.   Basis Technology – Open Source Search Conference 2012 5
  • 6. Language  is  Complex   Basis Technology – Open Source Search Conference 2012 6 hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png  
  • 7. Language  is  Complex!  •  The  Spanish  word  “pasaportar”  has  more  than  50   inflected  forms:   pasaportando   pasaportareis   pasaportarán   pasaportes   pasaportaron   pasaporte   pasaportada   pasaportase   pasaportan   pasaportaba   pasaportemos   pasaporta   pasaportarían   pasaportaría   pasaportaste   pasaportarais   pasaportara   pasaportad   pasaportasen   pasaportasteis   pasaportéis   pasaportaren   pasaportáramos   pasaportadas   pasaportado   pasaportaban   pasaporté   pasaportaremos   pasaportásemos   pasaportados   pasaportábamos   pasaportamos   pasaportaré   pasaportases   pasaporten   pasaportare   pasaportaríais   pasaportaréis   pasaportará   pasaportaran   pasaportabas   pasaportó   pasaportarías   pasaportaríamos   pasaportabais   pasaportaras   pasaportáremos   pasaportaseis   pasaportarás   pasaporto   …   Basis Technology – Open Source Search Conference 2012 7 http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
  • 8. Language  Examples  •  English:   spoke  (Noun  –  wheel  part)   →  spoke   spoke  (Verb,  past  tense)   →  speak  •  French:   été  (summer)   →    été  (summer)   été  (was)         →  être  (to  be)  •  German:     Robbe  (seal)   →  Robbe  (seal)   robbe  (I  crawl)   →  robben  (to  crawl)   Samstagmorgen  (Saturday  Morning)   →  Samstag,  Morgen  (compound)  •  Japanese:   •  首脳会談後、オバマ大統領は記者団の質問に答える予定   –  Where  are  the  words??   Basis Technology – Open Source Search Conference 2012 8
  • 9. Language-­‐Aware  Search  Technology  •  RoseOe  Linguis>c  Plaiorm     •  Language  Iden>fica>on   •  Tokeniza>on   »  Morphological   •  Token  processing   »  Lemma>za>on   •  Higher  level  analy>cs   »  En>ty  Extrac>on   »  Rela>onship  Extrac>on   •  En>ty  Transla>on  and  En>ty  Search   Basis Technology – Open Source Search Conference 2012 9
  • 10. Language  Iden>fica>on   •  Find  a  single  dominant  language  in  a  document   •  Find  mul>ple  languages  in  a  single  document   Basis Technology – Open Source Search Conference 2012 10
  • 11. Tokeniza>on  •  Morphological  Analysis  vs.  N-­‐gram  •  Search  Term:    東京 ルパン上映時間 •  N-­‐gram:  •  Morphological  Analysis:       Basis Technology – Open Source Search Conference 2012 11
  • 12. Token  Processing  •  Stemming  vs.  Lemma>za>on  •  English:  “I  have  spoken  at  several  conferences”  •  Stemming:  •  Lemma>za>on:   Basis Technology – Open Source Search Conference 2012 12
  • 13. Stemming  vs.  Lemma>za>on  •  Two  words  with  the  same  spelling,  but  different   meanings  create  the  same  stem.   Stemming   LemmaCzaCon   prensa     →  prens   Prensa   →  prensa   (media)    (media)   (media)   prensa     →  prens   prensa     →  prensar    (he/she  presses)          (he/she  presses)          (to  press)       INCORRECT     CORRECT   Basis Technology – Open Source Search Conference 2012 13
  • 14. Stemming  vs.  Lemma>za>on  •  Two  different  words  create  the  same  stem.   Stemming   LemmaCzaCon   publicaciones     →  public   publicaciones   →  publicación     (publicaCons)   (publicaCons)     publico     →  public   publico     →  public     (public)   (public)   (public)       INCORRECT     CORRECT   Basis Technology – Open Source Search Conference 2012 14
  • 15. Token  Processing  German:  “Am  Samstagmorgen  fliege  ich  zurueck  nach   Boston.”  •  Stemming:  •  Lemma>za>on  (and  decompounding!):   Basis Technology – Open Source Search Conference 2012 15
  • 16. How  to  Configure  Solr  •  Challenges   •  Mul>ple  languages  in  the  data  set  •  Goals:   1.  Language  Iden>fica>on   2.  Language-­‐aware  Search:   •  Tokeniza>on   •  Token  Processing   Basis Technology – Open Source Search Conference 2012 16
  • 17. How  to  Configure  Solr  •  What  tools  does  Solr  have  to  work  with?   •  UpdateRequestProcessor   •  Analyzer/CharFilter/Tokenizer/TokenFilter   •  Solr  Cores  •  Pre-­‐process  data  before  Solr?   Basis Technology – Open Source Search Conference 2012 17
  • 18. Solr  UpdateRequestProcessor  •  Runs  Before  Analyzers  •  Full  Access  to  Document  •  Two  op>ons:     •  Run  the  analysis  directly  in  Solr   •  Good  for  Lightweight  Analysis   •  Call  out  to  external  analysis  services   •  Web  Services/UIMA.  Increases  Complexity  •  Limita>ons:     •  Think  through  your  indexing  strategy     Basis Technology – Open Source Search Conference 2012 18
  • 19. Solr  Analyzer/Tokenizer  •  Good  for:   •  Segmenta>on  of  Asian  Language   •  Linguis>cs  -­‐  Lemma>za>on  •  Limita>ons:   •  No  access  to  document  object    •  Schema.xml   •  FieldType   •  Analyzer   –  CharFilter   –  Tokenize   –  TokenFilter   Basis Technology – Open Source Search Conference 2012 19
  • 20. Goal  1:  Language  ID    •  UpdateRequestProcessor   •  Runs  before  field-­‐level  analysis  takes  place   •  Basic  Language  Iden>fier  URP  to  be  included  in  Solr  •  Outside  Solr    What  do  you  do  with  the  language  informa>on??   Basis Technology – Open Source Search Conference 2012 20
  • 21. Goal  2:  Mul>-­‐Lingual  Support  in  Solr   •  Three  main  approaches:   1.  One  Solr  field  for  each  language   2.  One  Solr  Core  per  language   3.  All  Languages  in  a  Single  Field  Informed  by  Trey  Grainger    @  Careerbuilder:  hOp://www.lucidimagina>on.com/sites/default/files/Grainger%20Trey%20-­‐%20Extending%20Solr,%20Building%20a%20Cloud-­‐Like%20Knowledge%20Discovery%20Plaiorm%20-­‐%20rev.pdf   Basis Technology – Open Source Search Conference 2012 21
  • 22. Mul>ple  Languages:  Method  1  •  One  field  for  each  language   •  Pro:   •  Simple  approach  and  implementa>on   •  Guarantees  that  queries  are  processed  the  same  way  as   index   •  Con:   •  Increased  query-­‐>me  complexity  (mi>gate  with  Dismax)   •  Decreased  query  speed  as  addi>onal  fields  are  queried   •  May  require  storing  mul>ple  copies  of  text   Basis Technology – Open Source Search Conference 2012 22
  • 23. Mul>ple  Languages:  Method  2  •  One  Solr  core  per  language    Each  Core  has  the  same  field,  with  a  language-­‐specific      Analyzer/Tokenizer   •  Pros:   •  No  query-­‐>me  performance  overhead   •  Guarantees  that  queries  are  processed  the  same  way  as   index   •  Cons:   •  Significant  complexity  in  managing  mul>ple  cores   •  Must  implement  custom  sharding   •  Does  not  support  mul>lingual  documents   Basis Technology – Open Source Search Conference 2012 23
  • 24. Mul>ple  Languages:  Method  3  •  All  Languages  in  one  field   •  Pros:   •  Single  field  makes  queries  and  indexing  easy   •  Same  schema/core  as  more  languages  added   •  Cons:   •  Requires  complex  custom  Tokenizer/Analyzer   •  Must  pass  in  language  informa>on  for  queries  and  indexing   •  Does  not  guarantee  queries  are  processed  the  same  as  the   index   •  Poten>al  TF/IDF  confusion       Basis Technology – Open Source Search Conference 2012 24
  • 25. Language  is  Important  •  Use  language  informa>on  at  index  and  query  >me  •  Increase  recall,  maintain  precision  •  BeOer  search  results  for  your  users   Basis Technology – Open Source Search Conference 2012 25
  • 26. My  Contact  Info  •  Steve  Kearns   •  skearns@basistech.com   •  hOp://www.basistech.com   Basis Technology – Open Source Search Conference 2012 26