Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Practical Search with Solr:
Beyond just Looking it Up
29 April 2010


     Bess Sadler, Stanford University Library
     N...
Slides posted at the end of this
Agenda                                                                          presentat...
About the Presenters
   Bess Sadler, Stanford University Library
         Senior Software Engineer at Stanford University ...
What, the data’s dirty?
  Clean data is
  easy to search
  and browse.
  However, you
  probably don’t                    ...
Before we begin, you should know

     Some basics around Solr that we won’t cover
       Gets and posts
       Search Ind...
Mapping Library Data Types:
Your data is not as different as you think it is
  Library               Engineering          ...
Data is weird
    Not Normal: The data is not always in the fields or places you
    expect, even when you have a detailed...
Using Solr Cell (aka Extracting Request Handler)/PDFBox

       Good news examples
         Please, God, just some metadat...
Search vs. Browse

     Search:
     More focused -- the user is looking for a known item, or has a
     specific question...
Search Challenges

     Relevancy – indexing the full text isn’t good enough
     Fielded search – context is meaningful (...
Why Browsing Is Important


     Search is not enough
     What is a facet?
     Here is how it works in Solr
     Here is...
Queries are not obvious
  Browsing Ordered Lists
  A Little About Dismax
  and
  When Simple Search
                      ...
Candidates for Browsing
  Names (Employees, Customers, Students, Authors)
  Part Numbers                                  ...
Some Values are Easily Ordered
     Numeric Values
     Dates (if normalized)
     Some Letter Tokens (e.g. categories)


...
Values Difficult to Sort Lexically
     Digits in non-numeric context
        lexical sort of numbers: 1, 111, 20, 222, 8 ...
Call Numbers are Difficult to Sort Lexically
  (applies to SKUs, Part Numbers,                      A7 .L3 .V2
  Non-unifo...
Normalization for Sorting is a Process
                                                                                   ...
Basic Sorting Normalization Strategies
    Normalize Letter Case (e.g. all lowercase)
    Leading Spaces (can use zeros fo...
a+++0007.000000
A7 .B3  b0.300000
       a+++0007.000000
A7 B33  b0.330000
       a+++0017.000000
A17 .B4 b0.400000
      ...
Weird Values Happen

     ZDVD 4971
     MFILM 24 REEL 5
     Shelved by title
     XX(123457)
     call # varies
     no ...
Solr Performance Issue: Query Time Sorting

     q=sortfield["666" TO *]&rows=10
        Will sort ALL of the sortfield va...
Solrconfig.xml




                              QUERY LOOKS LIKE:

                              http://host:port/solr/al...
/solr/alphaTerms?
     terms.fl=shelfkey&
     terms.lower=lc+hc++0337.000000+f0.500000+f0.512000&
     per_page=10



   ...
Now that I Have Terms, How do I get the Documents?
/solr/select?
q=sortkey:( “a++67+mn++4” OR “a++67+mp+85”)
&qt=standard
...
Sortfield Value : Document                                     NOT always 1:1

  1:Many One Sortfield Value – Multiple Doc...
What About Browsing Before the Known Sort Value?




               n Before                                              ...
Create Reverse Sortkey
 Use a simple character mapping to reverse the sort order
    IF   sortkey HAS                     ...
Lucid Imagination, Inc. – http://www.lucidimagination.com   28
A Little About Dismax




Lucid Imagination, Inc. – http://www.lucidimagination.com
                                      ...
Solr QueryParsing Strategies

FEATURE                                                                     LUCENE   DISMAX
...
Dismax (disjoint max) Query Parser:
 Some of My Favorite Things
     Assign boost values for field matching at query time ...
Example Dismax Request Handler
<!-- author search request handler -->
      <requestHandler name="search_author" class="so...
Sometimes,
Simple Search + Facets
   is Not Enough




 Lucid Imagination, Inc. – http://www.lucidimagination.com
        ...
WHEN isn’t it enough?

  Pay attention to user feedback
  Study Search Logs
         Queries without results




         ...
Our Users Also Asked for:
  Boolean
  Targeting a particular (group of) fields
    “… combined searching feature so that I...
Search Form has More Than One Text Box
    Want Features of Dismax
    Need Way to Boost Appropriately for Each Text Box
 ...
Local Params                              LocalParams allow additional, localized
                                        ...
Solrconfig.xml
      <requestHandler name=”multi_box" class="solr.SearchHandler" >
           <lst name="defaults">
      ...
Using LocalParams Variables
    Text boxes combined with AND
      _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms...
Note: DISMAX doesn’t do Boolean within the text
boxes: there are workarounds …
     edismax (Solr 1.5)
     faking it:
   ...
My Favorite Places To Find Information
 LucidImagination Search
   http://www.lucidimagination.com/search/
   (NOT a coerc...
Big, Bigger Biggest
  Large scale issues:
  Phrase queries and common words
  OCR

                                       ...
Hathi Trust Large Scale Search Challenges
 Goal: Design a system for full-text search that
 will scale to 5 million to 20 ...
Index Size, Caching, and Memory

  Our documents average about 300 pages
  which is about 700KB of OCR.
  Our 5 million do...
Response time varies with query

                                                                        Average:    673
 ...
Slowest 5 % of queries
                                                                                         The slowes...
Query processing
  Phrase queries use position index (Boolean queries do not).
  Position index accounts for 85% of index ...
Slow Queries
  Slowest test query: “the lives and literature of the beat
  generation” took 2 minutes.
  4MB data read for...
Why not use Stop Words?
  The word “the” occurs more than 4 billion times in our 1 million
  document index.
  Removing “s...
“CommonGrams”

 Ported Nutch “CommonGrams” algorithm to Solr
 Create Bi-Grams selectively for any two word sequence contai...
Standard index vs. CommonGrams
Standard Index                                             Common Grams
                  T...
Comparison of Response time (ms)
                                                                                      SLO...
Other issues

     Analyze your slowest queries
       We analyzed the slowest queries from our query logs and
       disc...
Other issues

     We broke Solr … temporarily
       Dirty OCR in combination with over 200 languages creates
       inde...
Q&A
      Download these slides at
      http://bit.ly/practical-solr
         On demand replay is
        available withi...
Upcoming SlideShare
Loading in …5
×

Practical Search with Solr: Beyond just Looking it Up

4,254 views

Published on

Before Google, before search, heck, even before SQL, search and retrieve meant one thing: the library. And you think you have a lot of noisy data in crusty formats to search? Even if you don't have 100 million books in your catalog, Solr applications for library data offer practical, general purpose solutions to some of the knottiest search problems.

Published in: Technology, Education

Practical Search with Solr: Beyond just Looking it Up

  1. 1. Practical Search with Solr: Beyond just Looking it Up 29 April 2010 Bess Sadler, Stanford University Library Naomi Dushay, Stanford University Library Tom Burton-West, the Hathi Trust Project
  2. 2. Slides posted at the end of this Agenda presentation; full replay available within ~48 hours of live webcast Introductions What, the data’s dirty? Bess Sadler Clean data is easy to search and browse. However, you probably don’t have clean data. Queries are not obvious: Naomi Dushay Browsing ordered lists; Dismax; When simple search is not enough Big, Bigger, Biggest: Tom Burton West Large scale issues: Phrase queries and common words; OCR Q&A Lucid Imagination, Inc. – http://www.lucidimagination.com 2
  3. 3. About the Presenters Bess Sadler, Stanford University Library Senior Software Engineer at Stanford University Library, and co-founder of Blacklight (http://projectblacklight.org), formerly the Chief Architect for the Online Library Environment at the U-VA. www-sul.stanford.edu Naomi Dushay, Stanford University Library Senior Software Engineer at Stanford University Library, expert in digital library research; formerly a member of the core infrastructure team of the National Science Digital Library. www-sul.stanford.edu Tom Burton-West, the Hathi Trust Project Information Retrieval Programmer in the University of Michigan’s Digital Library Production Service; works on the Hathi Trust Large Scale Search project and blogs about it at www.hathitrust.org/blogs. Lucid Imagination, Inc. – http://www.lucidimagination.com 3
  4. 4. What, the data’s dirty? Clean data is easy to search and browse. However, you probably don’t Bess Sadler have clean data. Stanford University Library Lucid Imagination, Inc. – http://www.lucidimagination.com 4 4
  5. 5. Before we begin, you should know Some basics around Solr that we won’t cover Gets and posts Search Index is not a DBMS XML Strong Data Typing We’ll refer to these in the talk; if you’re unfamiliar with them, see: bit.ly/practical-solr for some quick definitions of these terms Lucid Imagination, Inc. – http://www.lucidimagination.com 5
  6. 6. Mapping Library Data Types: Your data is not as different as you think it is Library Engineering Health Care Intellectual Legal Property Books Specs Research papers Patents Contracts Personal Name Concept Disease Types Mechanisms Parties Publication Formal Journals Filing and Rulings and court Documentations Disclosure Docs documents Combined Facets: Test results, analog Test results, analog Authors, titles, prior Exhibits, photos, Book, Video, data files, Media data files, Media art, assignees, criminal evidence, Journal, files, data sets, rich files,patient records claims, descriptions, emails/e-discovery Newspaper, documents figures Physical Artifacts, SKUs Digital artifacts Other domains: pharmaceutical, manufacturing, etc. are similar in the diversity of document types and data types within the documents Lucid Imagination, Inc. – http://www.lucidimagination.com 6
  7. 7. Data is weird Not Normal: The data is not always in the fields or places you expect, even when you have a detailed spec. Local practices differ Practices change over time Sometimes stuff is just wrong (but remember: it’s better to be consistent than right) Be prepared for cleanup – indexing your data is going to uncover a lot of problems you never knew about before Formats are not necessarily optimized for discovery For example: PDFs are optimized for presentation, not discovery; putting them into a discovery system presents its own challenges. Lucid Imagination, Inc. – http://www.lucidimagination.com 7
  8. 8. Using Solr Cell (aka Extracting Request Handler)/PDFBox Good news examples Please, God, just some metadata! When we got lucky, we had another source of the metadata Bad News examples Typography Text inset boxes It’s only a little easier than OCR … Advanced PDFBox options only work when there is a lot of consistency Lucid Imagination, Inc. – http://www.lucidimagination.com 8
  9. 9. Search vs. Browse Search: More focused -- the user is looking for a known item, or has a specific question to be answered. (e.g., a citation, a part number, a specific judicial ruling, “that book by Steinbeck”) Browse: The user has a generalized, nebulous information need that they will refine as they interact with a collection of resources. (e.g., finding a good book to read, shopping for accessories, keeping current in one’s field) Lucid Imagination, Inc. – http://www.lucidimagination.com 9
  10. 10. Search Challenges Relevancy – indexing the full text isn’t good enough Fielded search – context is meaningful (“Cook” example) Fielded search – will data be where you expect it to be? Users don’t’ speak your jargon: “indian cooking” is “Cookery Indic” Stemming -- Nature/Naturalism How do you know you have your relevancy rankings right? You ask! Lucid Imagination, Inc. – http://www.lucidimagination.com 10
  11. 11. Why Browsing Is Important Search is not enough What is a facet? Here is how it works in Solr Here is why your users will like you for doing it. More challenges related to browsing coming up… Lucid Imagination, Inc. – http://www.lucidimagination.com 11
  12. 12. Queries are not obvious Browsing Ordered Lists A Little About Dismax and When Simple Search Naomi Dushay Is Not Enough Stanford University Library Lucid Imagination, Inc. – http://www.lucidimagination.com 12
  13. 13. Candidates for Browsing Names (Employees, Customers, Students, Authors) Part Numbers One Strategy for Data that is When Spelling is Unclear Not Normalized is Browsing Ordered Lists uighur, uyghur, uyghar, uigher Strings of Both Letters and Digits, such as SKUs, Part Numbers, Invoice Numbers, Transaction Record Numbers Addresses in Sequence Titles (Books, …) Lucid Imagination, Inc. – http://www.lucidimagination.com 13
  14. 14. Some Values are Easily Ordered Numeric Values Dates (if normalized) Some Letter Tokens (e.g. categories) Lucid Imagination, Inc. – http://www.lucidimagination.com 14
  15. 15. Values Difficult to Sort Lexically Digits in non-numeric context lexical sort of numbers: 1, 111, 20, 222, 8 … “A715C74” “The Princess and the Pea” “Sir Isaac Newton” “Die Fledermaus” piña vs. pina Lucid Imagination, Inc. – http://www.lucidimagination.com 15
  16. 16. Call Numbers are Difficult to Sort Lexically (applies to SKUs, Part Numbers, A7 .L3 .V2 Non-uniform serial numbers across domains, etc.) A7 .L3 V2 A7 .L3 V.2 Letters combined with DIGITS A7 .L3 1902 V.2 Some digits are decimals, some are integers M5 .L3 2000 .K2 1880 Inconsistent punctuation M5 .L3 .K451 V.5 Suffixes to be ignored for M5 .L3 K2 D MAJ 1880 sorting purposes M5 .L3 K2 OP.7:NO.6 1880 A7 .L3 1902 V2 TANEYTOWN M5 .L3 K2 .Q2 MD:CRAP0*DMA 1981 Lucid Imagination, Inc. – http://www.lucidimagination.com 16
  17. 17. Normalization for Sorting is a Process It might not Programmatically need to be Normalize Data perfect. Clean up data Assess Sorted Output Humans Find Dirty Data Automated test Lucid Imagination, Inc. – http://www.lucidimagination.com 17
  18. 18. Basic Sorting Normalization Strategies Normalize Letter Case (e.g. all lowercase) Leading Spaces (can use zeros for digits; space works) Trailing Spaces Skip Ignored Characters (“The Fly”, “Ms. Jane Doe”) Numbers sorted as an Integer (leading spaces/zeros), vs. as a Decimal (trailing spaces/zeros)? Normalization should accommodate dirty data whenever practical. Lucid Imagination, Inc. – http://www.lucidimagination.com 18
  19. 19. a+++0007.000000 A7 .B3 b0.300000 a+++0007.000000 A7 B33 b0.330000 a+++0017.000000 A17 .B4 b0.400000 Lucid Imagination, Inc. – http://www.lucidimagination.com 19
  20. 20. Weird Values Happen ZDVD 4971 MFILM 24 REEL 5 Shelved by title XX(123457) call # varies no call number Lucid Imagination, Inc. – http://www.lucidimagination.com 20
  21. 21. Solr Performance Issue: Query Time Sorting q=sortfield["666" TO *]&rows=10 Will sort ALL of the sortfield values at Query Time Response time abysmal for sortfields with huge numbers of values Try this: Terms Component Lucid Imagination, Inc. – http://www.lucidimagination.com 21
  22. 22. Solrconfig.xml QUERY LOOKS LIKE: http://host:port/solr/alphaTerms? terms.fl= terms.lower= per_page= Lucid Imagination, Inc. – http://www.lucidimagination.com 22
  23. 23. /solr/alphaTerms? terms.fl=shelfkey& terms.lower=lc+hc++0337.000000+f0.500000+f0.512000& per_page=10 TermsComponent queries the part of the index that is already lexically sorted for each field. Lucid Imagination, Inc. – http://www.lucidimagination.com 23
  24. 24. Now that I Have Terms, How do I get the Documents? /solr/select? q=sortkey:( “a++67+mn++4” OR “a++67+mp+85”) &qt=standard (URL encode if you need to) Lucid Imagination, Inc. – http://www.lucidimagination.com 24
  25. 25. Sortfield Value : Document NOT always 1:1 1:Many One Sortfield Value – Multiple Documents One product, multiple generations of user manuals One court case, multiple briefing and disclosure documents Many:1 One Document – Multiple Sortfield Values Which value are you going to pick for browsing list? Allow user to select in UI, if possible Lucid Imagination, Inc. – http://www.lucidimagination.com 25
  26. 26. What About Browsing Before the Known Sort Value? n Before n After http://hayward-ca.gov/refreshyourlife/wp-content/uploads/2009/07/fiction-spines.jpg Lucid Imagination, Inc. – http://www.lucidimagination.com 26
  27. 27. Create Reverse Sortkey Use a simple character mapping to reverse the sort order IF sortkey HAS reversekey GETS 0 Z 1 Y … … 9 Q A P … … Z 0 Lucid Imagination, Inc. – http://www.lucidimagination.com 27
  28. 28. Lucid Imagination, Inc. – http://www.lucidimagination.com 28
  29. 29. A Little About Dismax Lucid Imagination, Inc. – http://www.lucidimagination.com 29
  30. 30. Solr QueryParsing Strategies FEATURE LUCENE DISMAX Boolean √ Each Text Box -> Groups of Index Fields √ √ Each Text Box -> Complex Boosting Equation yuck √ Multiple Text Boxes yuck √ Multiple Query Words Match Across Fields √ Boosting Matches Simple √ “Author” “Title” “Subject” Searches Lucid Imagination, Inc. – http://www.lucidimagination.com 30
  31. 31. Dismax (disjoint max) Query Parser: Some of My Favorite Things Assign boost values for field matching at query time BUT complex boosting formulae can reside in solrconfig.xml Index can be neutral; assign query time boosting to fields for different types of queries Easy to boost exact phrase matches higher than query terms scattered across document. Tune how many query words MUST match, and what the other matching thresholds/parameters might be http://wiki.apache.org/solr/DisMaxRequestHandler Lucid Imagination, Inc. – http://www.lucidimagination.com 31
  32. 32. Example Dismax Request Handler <!-- author search request handler --> <requestHandler name="search_author" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <!-- require 4 or more terms to match … --> <str name="mm">4&lt;-1 4&lt;90%</str> <!– boost formula --> <str name="qf">author_unstem^10 author native_script_author</str> <!-- boost phrase matches --> <str name="pf">author_unstem^100 author^10 native_script_author^10</str> … http://wiki.apache.org/solr/DisMaxRequestHandler Lucid Imagination, Inc. – http://www.lucidimagination.com 32
  33. 33. Sometimes, Simple Search + Facets is Not Enough Lucid Imagination, Inc. – http://www.lucidimagination.com 33
  34. 34. WHEN isn’t it enough? Pay attention to user feedback Study Search Logs Queries without results Lucid Imagination, Inc. – http://www.lucidimagination.com 34
  35. 35. Our Users Also Asked for: Boolean Targeting a particular (group of) fields “… combined searching feature so that I can specify the author and title.” (author) Mozart (title) sonata 21 – not a book about Mozart’s sonatas “I often search publisher AND year, or publisher AND place of publication, and occasionally need all three terms in combination.” (publisher) “Little, Brown & Co” – not “The Little Brown Jug” Plaintiff, Defendant, Attorney – all? Lucid Imagination, Inc. – http://www.lucidimagination.com 35
  36. 36. Search Form has More Than One Text Box Want Features of Dismax Need Way to Boost Appropriately for Each Text Box Need Way to Combine Text Boxes Lucid Imagination, Inc. – http://www.lucidimagination.com 36
  37. 37. Local Params LocalParams allow additional, localized instructions to be sent as part of the query. Ways to Parse Query Terms Send in Non-Default Values for Variables Use Variables Declared in Request Handler That Don’t Map To QueryParser Arguments http://wiki.apache.org/solr/LocalParams Lucid Imagination, Inc. – http://www.lucidimagination.com 37
  38. 38. Solrconfig.xml <requestHandler name=”multi_box" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">lucene</str> <str name=”q.op”>AND</str> <!– author box --> <str name="qf_author">author fields boost formula</str> <str name="pf_author">author phrase boosts</str> <!– title box --> <str name="qf_title">title fields boost formula</str> <str name="pf_title">title phrase boosts</str> … Lucid Imagination, Inc. – http://www.lucidimagination.com 38
  39. 39. Using LocalParams Variables Text boxes combined with AND _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” AND _query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms” Text boxes combined with OR _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” OR _query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms” Lucid Imagination, Inc. – http://www.lucidimagination.com 39
  40. 40. Note: DISMAX doesn’t do Boolean within the text boxes: there are workarounds … edismax (Solr 1.5) faking it: http://www.stanford.edu/people/~ndushay/code4lib2010/advSe archSolrQueries.pdf Lucid Imagination, Inc. – http://www.lucidimagination.com 40
  41. 41. My Favorite Places To Find Information LucidImagination Search http://www.lucidimagination.com/search/ (NOT a coerced statement!) Solr wikis http://wiki.apache.org/solr/FrontPage Lucid Imagination, Inc. – http://www.lucidimagination.com 41
  42. 42. Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project Lucid Imagination, Inc. – http://www.lucidimagination.com 42
  43. 43. Hathi Trust Large Scale Search Challenges Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) Challenges: Must scale to 20 million full-text volumes Very long documents compared to most large-scale search applications Multilingual collection OCR quality varies Lucid Imagination, Inc. – http://www.lucidimagination.com 43
  44. 44. Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index Lucid Imagination, Inc. – http://www.lucidimagination.com 44
  45. 45. Response time varies with query Average: 673 Median: 91 90th: 328 99th: 7,504 Lucid Imagination, Inc. – http://www.lucidimagination.com 45
  46. 46. Slowest 5 % of queries The slowest 5% of queries took about 1 second or longer. Response Time 95th percentile (seconds) The slowest 1% of queries took between 10 seconds and 2 minutes. 1,000 Response Time Slowest 0.5% of queries took (seconds) 100 between 30 seconds and 2 minutes 10 These queries affect response time of 1 other queries 0 940 950 960 970 980 990 1,000 Cache pollution Query number Contention for resources Slowest queries are phrase queries containing common words Lucid Imagination, Inc. – http://www.lucidimagination.com 46
  47. 47. Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as “the” can be many GB in size This causes lots of disk I/O . Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache Lucid Imagination, Inc. – http://www.lucidimagination.com 47
  48. 48. Slow Queries Slowest test query: “the lives and literature of the beat generation” took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query. NUMBER OF POSTINGS LIST TOTAL TERM OCCURRENCES POSITION LIST WORD DOCUMENTS (SIZE MB) (MILLIONS) (SIZE MB) the 800,000 0.8 4,351 4,351 of 892,000 0.89 2,795 2,795 and 769,000 0.77 1,870 1,870 literature 435,000 0.44 9 9 generation 414,000 0.41 5 5 lives 432,000 0.43 5 5 beat 278,000 0.28 1 1 TOTAL 4.02 9,036 Lucid Imagination, Inc. – http://www.lucidimagination.com 48
  49. 49. Why not use Stop Words? The word “the” occurs more than 4 billion times in our 1 million document index. Removing “stop” words (“the”, “of” etc.) not desirable for our use cases. Couldn’t search for many phrases “to be or not to be” “the who” “man in the moon” vs. “man on the moon” Stop words in one language are content words in another language German stop words “war” and “die” are content words in English English stop words “is” and “by” are content words (“ice” and “village”) in Swedish Lucid Imagination, Inc. – http://www.lucidimagination.com 49
  50. 50. “CommonGrams” Ported Nutch “CommonGrams” algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: “The lives and literature of the beat generation” “the-lives” “lives-and” “and-literature” “literature-of” “of-the” “the-beat” “generation” Lucid Imagination, Inc. – http://www.lucidimagination.com 50
  51. 51. Standard index vs. CommonGrams Standard Index Common Grams TOTAL TOTAL NUMBER OF NUMBER OF OCCURRENCES OCCURRENCES WORD DOCS DOCS IN CORPUS IN CORPUS (THOUSANDS) (THOUSANDS) (MILLIONS) TERM (MILLIONS) the 2,013 386 of-the 446 396 of 1,299 440 generation 2.42 262 and 855 376 the-lives 0.36 128 literature 4 210 literature-of 0.35 103 lives 2 194 lives-and 0.25 115 generation 2 199 and-literature 0.24 77 beat 0.6 130 the-beat 0.06 26 TOTAL 4,176 TOTAL 450 Lucid Imagination, Inc. – http://www.lucidimagination.com 51
  52. 52. Comparison of Response time (ms) SLOWEST AVERAGE MEDIAN 90th 99th QUERY Standard Index 459 32 146 6,784 120,595 Common 68 3 71 2,226 7,800 Grams Lucid Imagination, Inc. – http://www.lucidimagination.com 52
  53. 53. Other issues Analyze your slowest queries We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list. We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked. We discovered that words such as “l’art” were being split into two token phrase queries. We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit. Lucid Imagination, Inc. – http://www.lucidimagination.com 53
  54. 54. Other issues We broke Solr … temporarily Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms Solr/Lucene index size was limited to 2.1 Billion unique terms Patched: Now it’s 274 Billion Dirty OCR is difficult to remove without removing “good” words. Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon. Lucid Imagination, Inc. – http://www.lucidimagination.com 54
  55. 55. Q&A Download these slides at http://bit.ly/practical-solr On demand replay is available within 24-48 hours of the live webcast Lucid Imagination, Inc. – http://www.lucidimagination.com 55

×