Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

9,404 views

Published on

Learn how the Lucene/Solr analyzer can grab and index text and field data, overcome grammatical and semantic variations, and how a little careful preparation and tuning lets you unleash the full power of Lucene/Solr Open Source Search.

Published in: Technology
  • Dating for everyone is here: ❤❤❤ http://bit.ly/2F7hN3u ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/2F7hN3u ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

  1. 1. Analyze This! Tom Hill Lucid Imagination Webinar 1/28/2010 Lucid Imagination, Inc.
  2. 2. Analyze This! Analysis Basics, Tips and Tools Lucid Imagination, Inc. Page 2 © 2010 Lucid Imagination, Inc.
  3. 3. Overview We’ll be covering: What is analysis, and why do you care? Some common problems with analysis Tools for troubleshooting Analyzer Tool Schema Browser Luke Existing Analyzers, Filters and Tokenizers Lucid Imagination, Inc. Page 3 © 2010 Lucid Imagination, Inc.
  4. 4. What is Analysis? • Converting your text into terms Solr does NOT search your text Solr searches the set of terms created by analysis Problems happen when the terms are not what you think they are Lucid Imagination, Inc. Page 4 © 2010 Lucid Imagination, Inc.
  5. 5. Examples Don’t => dont iPhone => i phone iphon τα πρώτα δείγματα =>πρωτα δειγματα The quick brown fox jumps => The quick brown fox jumps Lucid Imagination, Inc. Page 5 © 2008-2009 © 2010 Lucid Imagination, Inc. 5
  6. 6. Different Effects of Analysis There are many ways to analyze a run of text. Break on whitespace, punctuation, caseChanges, numb3rs Stemming (shoes -> shoe) Removing/replacing unwanted words/symbols Combining words Adding new words (synonyms) And many more Lucid Imagination, Inc. Page 6 © 2008-2009 © 2010 Lucid Imagination, Inc. 6
  7. 7. Copy Fields 1 It’s common to want to index data more than one way You might store an analyzed version of a field for searching And store an unanalyzed version for faceting or sorting You might store a stemmed and non-stemmed version of a field To boost precise matches Lucid Imagination, Inc. Page 7 © 2010 Lucid Imagination, Inc.
  8. 8. Copy Fields 2 It’s also common to copy to a common destination field For example: “alltext” Note this copies from the SOURCE of the copied field Not the analyzed version of the copied field <copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="manu" dest="text"/> Lucid Imagination, Inc. Page 8 © 2010 Lucid Imagination, Inc.
  9. 9. What could go wrong? • Lots of things You can’t find things You find too much Poor query or indexing performance Lucid Imagination, Inc. Page 9 © 2010 Lucid Imagination, Inc.
  10. 10. Common Scenario #1 Someone sets up Solr for the first time Adds some data Then posts to the mailing list, and says “why can’t I find my data?” The problem’s basic, but it’s useful to know how to identify it. Lucid Imagination, Inc. Page 10 © 2010 Lucid Imagination, Inc.
  11. 11. “When I Search For ‘fox’…” Lucid Imagination, Inc. Page 11 © 2010 Lucid Imagination, Inc.
  12. 12. “…I Find Nothing” Lucid Imagination, Inc. Page 12 © 2010 Lucid Imagination, Inc.
  13. 13. “But, If I look at the index” Lucid Imagination, Inc. Page 13 © 2010 Lucid Imagination, Inc.
  14. 14. “It’s right there” Lucid Imagination, Inc. Page 14 © 2010 Lucid Imagination, Inc.
  15. 15. Analysis Tool Your first stop for figuring out analysis problems Lucid Imagination, Inc. Page 15 © 2010 Lucid Imagination, Inc.
  16. 16. Analysis Tool Lucid Imagination, Inc. Page 16 © 2010 Lucid Imagination, Inc.
  17. 17. Analysis Tool Demo Lucid Imagination, Inc. Page 17 © 2010 Lucid Imagination, Inc.
  18. 18. Stored vs. Indexed Solr can store both analyzed and un-analyzed content But you knew that … “stored” vs. “indexed” in the field definition How can you see what is actually indexed? …that is, the terms you can search for Lucid Imagination, Inc. Page 18 © 2010 Lucid Imagination, Inc.
  19. 19. Schema Browser Schema Browser lets you examine the fields and how they are configured. It also allows you to examine the terms in the index Lucid Imagination, Inc. Page 19 © 2010 Lucid Imagination, Inc.
  20. 20. Schema Browser Lucid Imagination, Inc. Page 20 © 2010 Lucid Imagination, Inc.
  21. 21. Schema Browser Lucid Imagination, Inc. Page 21 © 2010 Lucid Imagination, Inc.
  22. 22. Schema Browser Demo Lucid Imagination, Inc. Page 22 © 2010 Lucid Imagination, Inc.
  23. 23. How Many of You Just Copied the Example Schema? • Just because it works for one person’s data, doesn’t mean it works for yours. • Take the time to look at the output Lucid Imagination, Inc. Page 23 © 2010 Lucid Imagination, Inc.
  24. 24. Luke Lucene Index Exploration Tool Allows you to look at (and modify) the contents of an index Lucid Imagination, Inc. Page 24 © 2010 Lucid Imagination, Inc.
  25. 25. Luke Main Screen Lucid Imagination, Inc. Page 25 © 2010 Lucid Imagination, Inc.
  26. 26. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 26 © 2010 Lucid Imagination, Inc.
  27. 27. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 27 © 2010 Lucid Imagination, Inc.
  28. 28. Close-up from last slide solr null_1 enterpris search server null_100 apach softwar foundat null_100 softwar null_100 search null_100 advanc full fulltext|text search capabl use lucen null_100 optim null_1 high … Lucid Imagination, Inc. Page 28 © 2010 Lucid Imagination, Inc.
  29. 29. Position Increment Gap The null_xxx entries are how luke represents the position increment between instances of multi-valued fields. The example had <field name=“text">Solr, the Enterprise Search Server</field> <field name=“text">Apache Software Foundation</field> Using a position increment prevents phrase queries from matching across different values of a field Without the gap “Server Apache” would be a valid phrase. Lucid Imagination, Inc. Page 29 © 2010 Lucid Imagination, Inc.
  30. 30. Analysis Can Affect Performance Analysis doesn’t just product success/failure on a search It can affect the query processing speed, too. Lucid Imagination, Inc. Page 30 © 2010 Lucid Imagination, Inc.
  31. 31. Slow Searches They index 500,000 books Multiple languages in one field So they can’t do stemming or stop words Their worst case query was: “The lives and literature of the beat generation” It took 2 minutes to run. The query requires checking every doc containing “the” & “and” And the position info for each occurrence Lucid Imagination, Inc. Page 31 © 2010 Lucid Imagination, Inc.
  32. 32. Bi-grams Bi-grams combine adjacent terms ““The lives and literature “ becomes “The lives” “lives and” “and literature” Only have to check documents that contain the pair adjacent to each other. Only have to look at position information for the pair But can triple the size of the index Word indexed by itself Lucid Imagination, Inc. Indexed both with preceding term, and following term Page 32 © 2010 Lucid Imagination, Inc.
  33. 33. Common Grams Form bi-grams only for common terms “The” occurs 2 billion times. “The lives” occurs 360k. Used the only 32 most common terms Average response went from 460 ms to 68ms. Lucid Imagination, Inc. Page 33 © 2010 Lucid Imagination, Inc.
  34. 34. Implied Phrase Queries Another example involved a query with “L’art” This turns into a phrase query, “L art” with the default config. PhraseQuery(text:"l art") “Turning it into the single token ‘L art’ is much more efficient. Occurs in far fewer documents that “L” Is a term query, not a phrase query. Lucid Imagination, Inc. Page 34 © 2010 Lucid Imagination, Inc.
  35. 35. Multiple Languages Generally, we suggest keeping different languages in their own fields This lets you have an analyzer for each language Stemming, stop words, etc. If you don’t know the total number of languages, you can use dynamic fields. That allows you to accept them, but not to dynamically stem, etc. Lucid Imagination, Inc. Page 35 © 2010 Lucid Imagination, Inc.
  36. 36. Analysis And Query Parsing What happens when parsing a query in Solr? You may have many fields, with different analyzers Which Analyzer gets used? Lucid Imagination, Inc. Page 36 © 2010 Lucid Imagination, Inc.
  37. 37. Analysis And Query Parsing QueryParser splits the query Understands quotes, parens and whitespace Gives the resulting pieces to the correct analyzer Explicit or Default Lucid Imagination, Inc. Page 37 © 2010 Lucid Imagination, Inc.
  38. 38. Analysis And Query Parsing To see what happens to your query Use the “Full Interface” section of the admin interface Check ‘debug: enable’ Or just add “&debugQuery=on” to the end of your query string We’re using the Lucene Query Parser Dismax does different things. Lucid Imagination, Inc. Page 38 © 2010 Lucid Imagination, Inc.
  39. 39. Seeing the results of query parsing Lucid Imagination, Inc. Page 39 © 2010 Lucid Imagination, Inc.
  40. 40. Seeing the results of query parsing Lucid Imagination, Inc. Page 40 © 2010 Lucid Imagination, Inc.
  41. 41. Query Examples title:foo bar Becomes: +title:foo +text:bar “foo” goes title field analyzer, bar to default field analyzer manu:”foo_bar baz” Becomes: manu:"foo bar baz“ Note _ got removed. The whole string goes to manu analyzer Phrase query title: (foo bar) Lucid Imagination, Inc. Becomes: title:foo title:bar foo and bar passed separately to title’s analyzer Page 41 © 2010 Lucid Imagination, Inc.
  42. 42. Components of an Analyzer Lucid Imagination, Inc. Page 42 © 2010 Lucid Imagination, Inc.
  43. 43. Components of an Analyzer CharFilters Tokenizers TokenFilters Lucid Imagination, Inc. Page 43 © 2010 Lucid Imagination, Inc.
  44. 44. CharFilters Used to clean up/regularize characters before passing to TokenFilter Remove accents, etc. MappingCharFilter They can also do complex things, we’ll look at HTMLStripCharFilter later. Lucid Imagination, Inc. Page 44 © 2010 Lucid Imagination, Inc.
  45. 45. Tokenizers Convert text to tokens (terms) Only one per analyzer Many Options WhitespaceTokenizer StandardTokenizer PatternTokenizer More… Lucid Imagination, Inc. Page 45 © 2010 Lucid Imagination, Inc.
  46. 46. TokenFilters Process the tokens produced by the Tokenizer Can be many of them per field Lucid Imagination, Inc. Page 46 © 2010 Lucid Imagination, Inc.
  47. 47. Some example TokenFilters that come with Solr/Lucene There are way too many to list them all We’re just going to go through a few of them Lucid Imagination, Inc. Page 47 © 2010 Lucid Imagination, Inc.
  48. 48. Reversing Filter Why? Leading wildcards require traversing the whole index Reverse the order, and leading wildcards become trailing *cats => stac* Only have to check terms that start with stac, instead of the whole index. Lucid Imagination, Inc. Page 48 © 2010 Lucid Imagination, Inc.
  49. 49. Phonetic Analysis Creates a phonetic representation of the text, for “sounds like” matching PhoneticFilterFactory. Uses one of Metaphone Double Metaphone Soundex Refined Soundex Lucid Imagination, Inc. Page 49 © 2010 Lucid Imagination, Inc.
  50. 50. Synonyms Synonym filter allows you to include alternate words that the user can use when searching For example, theater, theatre Useful for movie titles, where words are deliberately mis-spelled Don’t over-use synonyms It helps recall, but lowers precision Produces tokens at the same token position “local theater company” theatre Lucid Imagination, Inc. Page 50 © 2010 Lucid Imagination, Inc.
  51. 51. HTML text extraction Removes html tags, attributes comments XML processing directives Removes <script> and <style> contents Replaces entities HtmlStripCharFilterFactory Lucid Imagination, Inc. Page 51 © 2010 Lucid Imagination, Inc.
  52. 52. Spell Checking Spell checker starts by analyzing the source terms into n-grams From the Lucene Wiki: Lucid Imagination, Inc. Page 52 © 2010 Lucid Imagination, Inc.
  53. 53. Spell Checking You don’t actually have to know that to use the spell checker But I think it’s kind of cool Use luke to explore the index generated by the spell checker. Lucid Imagination, Inc. Page 53 © 2010 Lucid Imagination, Inc.
  54. 54. And many more Regular expression Tokenizer Stemmers for many languages Persian, Hindi, Chinese, Japanese, etc. Third party/commercial stemmers available, too. SnowballPorterFilter Lucid Imagination, Inc. Page 54 © 2010 Lucid Imagination, Inc.
  55. 55. Recap If you can’t find it, and you are sure it’s there: It’s likely an analysis problem Three main tools for troubleshooting analysis Analysis tool Schema browser Luke Look at your index, documents and the output of your analyzers periodically. Lucid Imagination, Inc. Page 55 © 2010 Lucid Imagination, Inc.
  56. 56. Additional Resources Lucid Imagination Solr Reference Guide LucidImagination.com/downloads Lucene in Action Second Edition This isn’t published yet, but you can get the early access version from manning.com/hatcher3 http://www.hathitrust.org/blog Solr wiki on Analysis Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Lucid Imagination, Inc. Luke - http://code.google.com/p/luke/ Page 56 © 2010 Lucid Imagination, Inc.
  57. 57. Questions If we have time, we’ll take some questions Lucid Imagination, Inc. Page 57 © 2010 Lucid Imagination, Inc.
  58. 58. Thanks! Tom Hill LucidImagination.com Lucid Imagination, Inc. Page 58 © 2010 Lucid Imagination, Inc.

×