Analyze This!


                              Tom Hill
                              Lucid Imagination
                   ...
Analyze This!




                 Analysis
         Basics, Tips and Tools



                          Lucid Imagination...
Overview
         We’ll be covering:
           What is analysis, and why do you care?
           Some common problems wit...
What is Analysis?

         • Converting your text into terms
              Solr does NOT search your text
              S...
Examples

                       Don’t => dont

                       iPhone => i phone
                                 ...
Different Effects of Analysis
                There are many ways to analyze a run of text.
                       Break o...
Copy Fields                                                                                  1


              It’s common...
Copy Fields                                                                                 2


              It’s also co...
What could go wrong?

         • Lots of things
              You can’t find things
              You find too much
      ...
Common Scenario #1

              Someone sets up Solr for the first time
              Adds some data
              Then ...
“When I Search For ‘fox’…”




                                       Lucid Imagination, Inc.




Page 11
                ...
“…I Find Nothing”




                              Lucid Imagination, Inc.




Page 12
                                  ...
“But, If I look at the index”




                                          Lucid Imagination, Inc.




Page 13
          ...
“It’s right there”




                               Lucid Imagination, Inc.




Page 14
                                ...
Analysis Tool

               Your first stop for figuring out analysis problems




                                     ...
Analysis Tool




                          Lucid Imagination, Inc.




Page 16
                                          ...
Analysis Tool Demo




                               Lucid Imagination, Inc.




Page 17
                                ...
Stored vs. Indexed

               Solr can store both analyzed and un-analyzed content
               But you knew that …...
Schema Browser
              Schema Browser lets you examine the fields and how they are
              configured.
       ...
Schema Browser




                           Lucid Imagination, Inc.




Page 20
                                        ...
Schema Browser




                           Lucid Imagination, Inc.




Page 21
                                        ...
Schema Browser Demo




                                Lucid Imagination, Inc.




Page 22
                              ...
How Many of You Just Copied the Example Schema?

          • Just because it works for one person’s data, doesn’t mean it
...
Luke

                 Lucene Index Exploration Tool
                 Allows you to look at (and modify) the contents of a...
Luke Main Screen




                             Lucid Imagination, Inc.




Page 25
                                    ...
Luke Document “Reconstruction”




                                   Lucid Imagination, Inc.




Page 26
                ...
Luke Document “Reconstruction”




                                   Lucid Imagination, Inc.




Page 27
                ...
Close-up from last slide

               solr null_1 enterpris search server
               null_100 apach softwar foundat...
Position Increment Gap

               The null_xxx entries are how luke represents the position
               increment ...
Analysis Can Affect Performance

               Analysis doesn’t just product success/failure on a search
               I...
Slow Searches

              They index 500,000 books
              Multiple languages in one field
                So the...
Bi-grams

              Bi-grams combine adjacent terms
              ““The lives and literature “ becomes
              “...
Common Grams

              Form bi-grams only for common terms
              “The” occurs 2 billion times. “The lives” oc...
Implied Phrase Queries

               Another example involved a query with “L’art”
               This turns into a phra...
Multiple Languages

              Generally, we suggest keeping different languages in their own
              fields
    ...
Analysis And Query Parsing

               What happens when parsing a query in Solr?
                You may have many fi...
Analysis And Query Parsing

               QueryParser splits the query
                 Understands quotes, parens and wh...
Analysis And Query Parsing

               To see what happens to your query
                 Use the “Full Interface” sec...
Seeing the results of query parsing




                                       Lucid Imagination, Inc.




Page 39
       ...
Seeing the results of query parsing




                                       Lucid Imagination, Inc.




Page 40
       ...
Query Examples

              title:foo bar
                Becomes: +title:foo +text:bar
                “foo” goes title...
Components of an Analyzer




                                      Lucid Imagination, Inc.




Page 42
                  ...
Components of an Analyzer

              CharFilters
              Tokenizers
              TokenFilters




             ...
CharFilters

               Used to clean up/regularize characters before passing to
               TokenFilter
          ...
Tokenizers

               Convert text to tokens (terms)
               Only one per analyzer
               Many Options...
TokenFilters

               Process the tokens produced by the Tokenizer
               Can be many of them per field



...
Some example TokenFilters that come with Solr/Lucene

               There are way too many to list them all
             ...
Reversing Filter

               Why?
                 Leading wildcards require traversing the whole index
              ...
Phonetic Analysis

               Creates a phonetic representation of the text, for “sounds like”
               matching...
Synonyms

              Synonym filter allows you to include alternate words that the
              user can use when sear...
HTML text extraction

               Removes html tags, attributes comments
               XML processing directives
     ...
Spell Checking

               Spell checker starts by analyzing the source terms into n-grams
               From the Luc...
Spell Checking

               You don’t actually have to know that to use the spell checker
               But I think it...
And many more

              Regular expression Tokenizer
              Stemmers for many languages
               Persian...
Recap

              If you can’t find it, and you are sure it’s there:
                  It’s likely an analysis problem
...
Additional Resources

               Lucid Imagination Solr Reference Guide
                 LucidImagination.com/download...
Questions

              If we have time, we’ll take some questions




                                       Lucid Imagi...
Thanks!
                Tom Hill
          LucidImagination.com



               Lucid Imagination, Inc.




Page 58
    ...
Upcoming SlideShare
Loading in …5
×

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

8,889 views

Published on

Learn how the Lucene/Solr analyzer can grab and index text and field data, overcome grammatical and semantic variations, and how a little careful preparation and tuning lets you unleash the full power of Lucene/Solr Open Source Search.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,889
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
27
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

  1. 1. Analyze This! Tom Hill Lucid Imagination Webinar 1/28/2010 Lucid Imagination, Inc.
  2. 2. Analyze This! Analysis Basics, Tips and Tools Lucid Imagination, Inc. Page 2 © 2010 Lucid Imagination, Inc.
  3. 3. Overview We’ll be covering: What is analysis, and why do you care? Some common problems with analysis Tools for troubleshooting Analyzer Tool Schema Browser Luke Existing Analyzers, Filters and Tokenizers Lucid Imagination, Inc. Page 3 © 2010 Lucid Imagination, Inc.
  4. 4. What is Analysis? • Converting your text into terms Solr does NOT search your text Solr searches the set of terms created by analysis Problems happen when the terms are not what you think they are Lucid Imagination, Inc. Page 4 © 2010 Lucid Imagination, Inc.
  5. 5. Examples Don’t => dont iPhone => i phone iphon τα πρώτα δείγματα =>πρωτα δειγματα The quick brown fox jumps => The quick brown fox jumps Lucid Imagination, Inc. Page 5 © 2008-2009 © 2010 Lucid Imagination, Inc. 5
  6. 6. Different Effects of Analysis There are many ways to analyze a run of text. Break on whitespace, punctuation, caseChanges, numb3rs Stemming (shoes -> shoe) Removing/replacing unwanted words/symbols Combining words Adding new words (synonyms) And many more Lucid Imagination, Inc. Page 6 © 2008-2009 © 2010 Lucid Imagination, Inc. 6
  7. 7. Copy Fields 1 It’s common to want to index data more than one way You might store an analyzed version of a field for searching And store an unanalyzed version for faceting or sorting You might store a stemmed and non-stemmed version of a field To boost precise matches Lucid Imagination, Inc. Page 7 © 2010 Lucid Imagination, Inc.
  8. 8. Copy Fields 2 It’s also common to copy to a common destination field For example: “alltext” Note this copies from the SOURCE of the copied field Not the analyzed version of the copied field <copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="manu" dest="text"/> Lucid Imagination, Inc. Page 8 © 2010 Lucid Imagination, Inc.
  9. 9. What could go wrong? • Lots of things You can’t find things You find too much Poor query or indexing performance Lucid Imagination, Inc. Page 9 © 2010 Lucid Imagination, Inc.
  10. 10. Common Scenario #1 Someone sets up Solr for the first time Adds some data Then posts to the mailing list, and says “why can’t I find my data?” The problem’s basic, but it’s useful to know how to identify it. Lucid Imagination, Inc. Page 10 © 2010 Lucid Imagination, Inc.
  11. 11. “When I Search For ‘fox’…” Lucid Imagination, Inc. Page 11 © 2010 Lucid Imagination, Inc.
  12. 12. “…I Find Nothing” Lucid Imagination, Inc. Page 12 © 2010 Lucid Imagination, Inc.
  13. 13. “But, If I look at the index” Lucid Imagination, Inc. Page 13 © 2010 Lucid Imagination, Inc.
  14. 14. “It’s right there” Lucid Imagination, Inc. Page 14 © 2010 Lucid Imagination, Inc.
  15. 15. Analysis Tool Your first stop for figuring out analysis problems Lucid Imagination, Inc. Page 15 © 2010 Lucid Imagination, Inc.
  16. 16. Analysis Tool Lucid Imagination, Inc. Page 16 © 2010 Lucid Imagination, Inc.
  17. 17. Analysis Tool Demo Lucid Imagination, Inc. Page 17 © 2010 Lucid Imagination, Inc.
  18. 18. Stored vs. Indexed Solr can store both analyzed and un-analyzed content But you knew that … “stored” vs. “indexed” in the field definition How can you see what is actually indexed? …that is, the terms you can search for Lucid Imagination, Inc. Page 18 © 2010 Lucid Imagination, Inc.
  19. 19. Schema Browser Schema Browser lets you examine the fields and how they are configured. It also allows you to examine the terms in the index Lucid Imagination, Inc. Page 19 © 2010 Lucid Imagination, Inc.
  20. 20. Schema Browser Lucid Imagination, Inc. Page 20 © 2010 Lucid Imagination, Inc.
  21. 21. Schema Browser Lucid Imagination, Inc. Page 21 © 2010 Lucid Imagination, Inc.
  22. 22. Schema Browser Demo Lucid Imagination, Inc. Page 22 © 2010 Lucid Imagination, Inc.
  23. 23. How Many of You Just Copied the Example Schema? • Just because it works for one person’s data, doesn’t mean it works for yours. • Take the time to look at the output Lucid Imagination, Inc. Page 23 © 2010 Lucid Imagination, Inc.
  24. 24. Luke Lucene Index Exploration Tool Allows you to look at (and modify) the contents of an index Lucid Imagination, Inc. Page 24 © 2010 Lucid Imagination, Inc.
  25. 25. Luke Main Screen Lucid Imagination, Inc. Page 25 © 2010 Lucid Imagination, Inc.
  26. 26. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 26 © 2010 Lucid Imagination, Inc.
  27. 27. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 27 © 2010 Lucid Imagination, Inc.
  28. 28. Close-up from last slide solr null_1 enterpris search server null_100 apach softwar foundat null_100 softwar null_100 search null_100 advanc full fulltext|text search capabl use lucen null_100 optim null_1 high … Lucid Imagination, Inc. Page 28 © 2010 Lucid Imagination, Inc.
  29. 29. Position Increment Gap The null_xxx entries are how luke represents the position increment between instances of multi-valued fields. The example had <field name=“text">Solr, the Enterprise Search Server</field> <field name=“text">Apache Software Foundation</field> Using a position increment prevents phrase queries from matching across different values of a field Without the gap “Server Apache” would be a valid phrase. Lucid Imagination, Inc. Page 29 © 2010 Lucid Imagination, Inc.
  30. 30. Analysis Can Affect Performance Analysis doesn’t just product success/failure on a search It can affect the query processing speed, too. Lucid Imagination, Inc. Page 30 © 2010 Lucid Imagination, Inc.
  31. 31. Slow Searches They index 500,000 books Multiple languages in one field So they can’t do stemming or stop words Their worst case query was: “The lives and literature of the beat generation” It took 2 minutes to run. The query requires checking every doc containing “the” & “and” And the position info for each occurrence Lucid Imagination, Inc. Page 31 © 2010 Lucid Imagination, Inc.
  32. 32. Bi-grams Bi-grams combine adjacent terms ““The lives and literature “ becomes “The lives” “lives and” “and literature” Only have to check documents that contain the pair adjacent to each other. Only have to look at position information for the pair But can triple the size of the index Word indexed by itself Lucid Imagination, Inc. Indexed both with preceding term, and following term Page 32 © 2010 Lucid Imagination, Inc.
  33. 33. Common Grams Form bi-grams only for common terms “The” occurs 2 billion times. “The lives” occurs 360k. Used the only 32 most common terms Average response went from 460 ms to 68ms. Lucid Imagination, Inc. Page 33 © 2010 Lucid Imagination, Inc.
  34. 34. Implied Phrase Queries Another example involved a query with “L’art” This turns into a phrase query, “L art” with the default config. PhraseQuery(text:"l art") “Turning it into the single token ‘L art’ is much more efficient. Occurs in far fewer documents that “L” Is a term query, not a phrase query. Lucid Imagination, Inc. Page 34 © 2010 Lucid Imagination, Inc.
  35. 35. Multiple Languages Generally, we suggest keeping different languages in their own fields This lets you have an analyzer for each language Stemming, stop words, etc. If you don’t know the total number of languages, you can use dynamic fields. That allows you to accept them, but not to dynamically stem, etc. Lucid Imagination, Inc. Page 35 © 2010 Lucid Imagination, Inc.
  36. 36. Analysis And Query Parsing What happens when parsing a query in Solr? You may have many fields, with different analyzers Which Analyzer gets used? Lucid Imagination, Inc. Page 36 © 2010 Lucid Imagination, Inc.
  37. 37. Analysis And Query Parsing QueryParser splits the query Understands quotes, parens and whitespace Gives the resulting pieces to the correct analyzer Explicit or Default Lucid Imagination, Inc. Page 37 © 2010 Lucid Imagination, Inc.
  38. 38. Analysis And Query Parsing To see what happens to your query Use the “Full Interface” section of the admin interface Check ‘debug: enable’ Or just add “&debugQuery=on” to the end of your query string We’re using the Lucene Query Parser Dismax does different things. Lucid Imagination, Inc. Page 38 © 2010 Lucid Imagination, Inc.
  39. 39. Seeing the results of query parsing Lucid Imagination, Inc. Page 39 © 2010 Lucid Imagination, Inc.
  40. 40. Seeing the results of query parsing Lucid Imagination, Inc. Page 40 © 2010 Lucid Imagination, Inc.
  41. 41. Query Examples title:foo bar Becomes: +title:foo +text:bar “foo” goes title field analyzer, bar to default field analyzer manu:”foo_bar baz” Becomes: manu:"foo bar baz“ Note _ got removed. The whole string goes to manu analyzer Phrase query title: (foo bar) Lucid Imagination, Inc. Becomes: title:foo title:bar foo and bar passed separately to title’s analyzer Page 41 © 2010 Lucid Imagination, Inc.
  42. 42. Components of an Analyzer Lucid Imagination, Inc. Page 42 © 2010 Lucid Imagination, Inc.
  43. 43. Components of an Analyzer CharFilters Tokenizers TokenFilters Lucid Imagination, Inc. Page 43 © 2010 Lucid Imagination, Inc.
  44. 44. CharFilters Used to clean up/regularize characters before passing to TokenFilter Remove accents, etc. MappingCharFilter They can also do complex things, we’ll look at HTMLStripCharFilter later. Lucid Imagination, Inc. Page 44 © 2010 Lucid Imagination, Inc.
  45. 45. Tokenizers Convert text to tokens (terms) Only one per analyzer Many Options WhitespaceTokenizer StandardTokenizer PatternTokenizer More… Lucid Imagination, Inc. Page 45 © 2010 Lucid Imagination, Inc.
  46. 46. TokenFilters Process the tokens produced by the Tokenizer Can be many of them per field Lucid Imagination, Inc. Page 46 © 2010 Lucid Imagination, Inc.
  47. 47. Some example TokenFilters that come with Solr/Lucene There are way too many to list them all We’re just going to go through a few of them Lucid Imagination, Inc. Page 47 © 2010 Lucid Imagination, Inc.
  48. 48. Reversing Filter Why? Leading wildcards require traversing the whole index Reverse the order, and leading wildcards become trailing *cats => stac* Only have to check terms that start with stac, instead of the whole index. Lucid Imagination, Inc. Page 48 © 2010 Lucid Imagination, Inc.
  49. 49. Phonetic Analysis Creates a phonetic representation of the text, for “sounds like” matching PhoneticFilterFactory. Uses one of Metaphone Double Metaphone Soundex Refined Soundex Lucid Imagination, Inc. Page 49 © 2010 Lucid Imagination, Inc.
  50. 50. Synonyms Synonym filter allows you to include alternate words that the user can use when searching For example, theater, theatre Useful for movie titles, where words are deliberately mis-spelled Don’t over-use synonyms It helps recall, but lowers precision Produces tokens at the same token position “local theater company” theatre Lucid Imagination, Inc. Page 50 © 2010 Lucid Imagination, Inc.
  51. 51. HTML text extraction Removes html tags, attributes comments XML processing directives Removes <script> and <style> contents Replaces entities HtmlStripCharFilterFactory Lucid Imagination, Inc. Page 51 © 2010 Lucid Imagination, Inc.
  52. 52. Spell Checking Spell checker starts by analyzing the source terms into n-grams From the Lucene Wiki: Lucid Imagination, Inc. Page 52 © 2010 Lucid Imagination, Inc.
  53. 53. Spell Checking You don’t actually have to know that to use the spell checker But I think it’s kind of cool Use luke to explore the index generated by the spell checker. Lucid Imagination, Inc. Page 53 © 2010 Lucid Imagination, Inc.
  54. 54. And many more Regular expression Tokenizer Stemmers for many languages Persian, Hindi, Chinese, Japanese, etc. Third party/commercial stemmers available, too. SnowballPorterFilter Lucid Imagination, Inc. Page 54 © 2010 Lucid Imagination, Inc.
  55. 55. Recap If you can’t find it, and you are sure it’s there: It’s likely an analysis problem Three main tools for troubleshooting analysis Analysis tool Schema browser Luke Look at your index, documents and the output of your analyzers periodically. Lucid Imagination, Inc. Page 55 © 2010 Lucid Imagination, Inc.
  56. 56. Additional Resources Lucid Imagination Solr Reference Guide LucidImagination.com/downloads Lucene in Action Second Edition This isn’t published yet, but you can get the early access version from manning.com/hatcher3 http://www.hathitrust.org/blog Solr wiki on Analysis Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Lucid Imagination, Inc. Luke - http://code.google.com/p/luke/ Page 56 © 2010 Lucid Imagination, Inc.
  57. 57. Questions If we have time, we’ll take some questions Lucid Imagination, Inc. Page 57 © 2010 Lucid Imagination, Inc.
  58. 58. Thanks! Tom Hill LucidImagination.com Lucid Imagination, Inc. Page 58 © 2010 Lucid Imagination, Inc.

×