Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

  • 352 views
Uploaded on

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, …

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.

http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
352
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
11
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Search Engine-Building with Lucene and Solr Part 2 Kai Chan SoCal Code Camp, November 2013
  • 2. Overview ● ● ● ● ● ● ● indexing process searching process advanced features scaling/redundancy resources demo questions/answers
  • 3. Indexing Process ● request handler ○ data are read to create documents ● update request processor chain ○ ○ ○ ○ optional document-wide processing fields can be added, changed, removed analysis creation of indexed and stored fields ● update handler ○ the index is updated
  • 4. Update Request Processor Chain ● de-duplication ○ creates a signature (hash) for each document to be added ○ replaces (delete) existing documents with the same signature ○ MD5Signature ■ exact hashing ○ Lookup3Signature ■ faster calculation and smaller hash than MD5 ○ TextProfileSignature ■ fuzzy hashing, near-duplicate detection
  • 5. Update Request Processor Chain ● language detection ○ detects the language used in field(s) ○ adds a language field to the document ○ TikaLanguageIdentifierUpdateProcessorFa ctory ■ uses Apache Tika ○ LangDetectLanguageIdentifierUpdateProce ssorFactory ■ uses language-detection library ○ external programs ■ e.g. Chromium Compact Language Detector See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>
  • 6. Analysis ● analyzed ○ tokenization, i.e. breaking down the content to be search into smaller units (“tokens”) ○ manipulation of tokens ● not analyzed ○ the whole content treated as 1 unit for searching ● analyzed v.s. not analyzed ○ are individual tokens meaningful on their own? ○ are individual tokens used in queries?
  • 7. Example 1: book title Lucene in Action, Second Edition: Covers Apache Lucene 3.0 Lucene in Action, Second Edition: Covers Apache Lucene 3.0 search for “Lucene”: no match Lucene in Action, Second Edition: Covers Apache Lucene 3.0 makes more sense to tokenize Example 2: ISBN 1-933-98817-7 1 933 98817 7 makes more sense to not tokenize 1 933 98817 7 search for “933”: match
  • 8. Analysis analyzed: ● text How about URL? not analyzed: ● number ● serial number ● GUID ● checksum
  • 9. Analysis ● character filter(s) ○ character replacement ○ e.g. accent marks with their base forms café → cafe jalapeño → jalapeno ● tokenizer ● token filter(s)
  • 10. Analysis ● character filter(s) ● tokenizer ○ create tokens (“words”) from characters ○ sometimes straightforward ○ many unusual cases: e-mail address, URL, code, etc. ● token filter(s)
  • 11. Analysis ● character filter(s) ● tokenizer ● token filter(s) ○ token replacement ■ change case, remove apostrophe ■ remove stop words (a, and, the, for) ■ split/join words (ice-cream, ice cream, icecream) ■ stemming (importing, imported → import) ■ synonym (nation → country)
  • 12. Field value: Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi! Tokens (text_general): 1 2 3 4 6 6 7 8 9 10 11 12 13 14 17 let's sign up for the amazing so cal code camp at http bit.ly oZiZsu fi Tokens (text_en): 1 2 3 17 let sign up fi Tokens (text_en_splitting): 1 2 3 20 let sign up fi 6 amaz 6 amaz 7 8 9 10 so cal code camp 7 8 9 10 so cal code camp socal 12 http 12 http 13 14 bit.li ozizsu 13 14 1516 17 20 16 free wi 15 16 free wi 18 19 bit ly o zi zsu free wi httpbitlyozizsu wifi 8 15 17
  • 13. Searching Process ● ● ● ● ● ● query parsing analysis scoring sorting loading of stored fields optional search components ○ ○ ○ ○ faceting term vector More Like This highlighting
  • 14. Scoring ● for a given query, each document not filtered out gets a score (float) ● higher score: higher in the results ● scoring algorithms ○ default: TF-IDF ○ other: Okapi BM25, etc. ○ very customizable See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
  • 15. Scoring - TF-IDF ● term frequency (TF) ○ how many times does this term appear in this document? ● inverse document frequency (IDF) ○ how many documents contain this term? ○ score proportional to the inverse of document frequency
  • 16. Scoring - Other Factors ● coordination factor (coord) ○ documents that contains all or most query terms get higher scores ● normalizing factor (norm) ○ adjust for field length and query complexity
  • 17. Scoring - Boost ● manual override: ask Lucene/Solr to give a higher score to some particular thing(s) ● index-time ○ per document ○ per field (of a particular document) ● search-time ○ per query
  • 18. More Like This ● finds documents similar in content (of one field) to those matched ● constructs a query based on the highest scoring terms in a document ● requires the field to: ○ have stored term vectors (recommended), or ○ be stored Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
  • 19. Spell Checking ● typos in queries happen ● returns spell checking suggestion (if any) within the same result ● can also be used for auto-complete ○ treating a prefix as a spelling mistake ○ returning full words as suggestions
  • 20. /select?q=text:"busness comunication"&spellcheck=true&wt=xml <lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset" >14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst> </lst>
  • 21. Query Elevation ● a.k.a. “sponsored search” ● make sure certain documents appear at the top of the results for a certain query
  • 22. Credit: Google Web Search <http://www.google.com/>
  • 23. Query Elevation ● configure the elevator search component in solrconfig.xml ● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude ● enable query elevation: enableElevation=true ● (optional) override the sort parameter: forceElevation=true
  • 24. Function Query ● like formulas in Excel ● apply functions to field values for filtering and scoring
  • 25. Function Query ● query: q={!func} cos(angle) ● query (range): q={!frange l=0.5 u=1} cos(angle) ● field: fl=angle,cos(angle) ● sort: sort=cos(angle) desc
  • 26. Spatial Search ● data: contains locations (longitudes, latitudes) ○ e.g. merchants with store locations ● search: filter and/or sort by location
  • 27. Credit: Google Maps <http://maps.google.com/>
  • 28. Spatial Search ● geofilt ○ circle centered at a given point ○ distance from a given point ○ fq={!geofilt sfield=store}&pt=45.15, -93.85&d=5 ● bbox ○ square (“bounding box”) centered at a given point ○ distance from a given point + corners ○ fq={!bbox sfield=store}&pt=45.15,-93.85 &d=5 Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  • 29. geofilt bbox 5 km (45.15, -93.85) Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> 5 km (45.15, -93.85)
  • 30. geofilt x bbox x x x o o 5 km 5 km o o (45.15, -93.85) (45.15, -93.85) o o x Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> o
  • 31. Spatial Search ● geodist ○ returns the distance between the location given in a field and a certain coordinate ○ e.g. sort by ascending distance from (45.15,-93.85), and return the distances as the score: q={!func}geodist()&sfield=store&pt=45. 15,-93.85&sort=score+asc Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  • 32. Scaling/Redundancy - Problems ● collection too large for a single machine ● too many requests for a single machine ● a machine can go down
  • 33. Scaling/Redundancy - Solutions ● collection too large for a single machine ○ distribution ■ spread the collection across multiple machines ● too many requests for a single machine ○ distribution ■ spread the requests across multiple machines ● a machine can go down ○ replication ■ copy data and configuration across multiple machines ■ make sure no single point of failure
  • 34. SolrCloud ● Solr instances ● ZooKeeper instances
  • 35. SolrCloud ● Solr instances ○ collection (logical index) divided into one or more partial collections (“shards”) ○ for each shard, one or more Solr instances keep copies of the data ■ one as leader - handles reads and writes ■ others as replicas - handle reads ● ZooKeeper instances
  • 36. SolrCloud ● Solr instances ● ZooKeeper instances ○ management of Solr instances ○ leader election ○ node discovery
  • 37. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica leader replica replica leader replica replica replica
  • 38. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica leader replica replica replica leader replica replica
  • 39. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica (offline) leader replica replica leader replica replica
  • 40. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica replica leader replica replica leader replica replica
  • 41. Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  • 42. Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  • 43. Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  • 44. Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/