Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Upcoming SlideShare
Loading in...5
×
 

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

on

  • 369 views

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013. ...

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.

http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a

Statistics

Views

Total Views
369
Views on SlideShare
369
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013) Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013) Presentation Transcript

  • Search Engine-Building with Lucene and Solr Part 2 Kai Chan SoCal Code Camp, November 2013
  • Overview ● ● ● ● ● ● ● indexing process searching process advanced features scaling/redundancy resources demo questions/answers
  • Indexing Process ● request handler ○ data are read to create documents ● update request processor chain ○ ○ ○ ○ optional document-wide processing fields can be added, changed, removed analysis creation of indexed and stored fields ● update handler ○ the index is updated
  • Update Request Processor Chain ● de-duplication ○ creates a signature (hash) for each document to be added ○ replaces (delete) existing documents with the same signature ○ MD5Signature ■ exact hashing ○ Lookup3Signature ■ faster calculation and smaller hash than MD5 ○ TextProfileSignature ■ fuzzy hashing, near-duplicate detection
  • Update Request Processor Chain ● language detection ○ detects the language used in field(s) ○ adds a language field to the document ○ TikaLanguageIdentifierUpdateProcessorFa ctory ■ uses Apache Tika ○ LangDetectLanguageIdentifierUpdateProce ssorFactory ■ uses language-detection library ○ external programs ■ e.g. Chromium Compact Language Detector See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>
  • Analysis ● analyzed ○ tokenization, i.e. breaking down the content to be search into smaller units (“tokens”) ○ manipulation of tokens ● not analyzed ○ the whole content treated as 1 unit for searching ● analyzed v.s. not analyzed ○ are individual tokens meaningful on their own? ○ are individual tokens used in queries?
  • Example 1: book title Lucene in Action, Second Edition: Covers Apache Lucene 3.0 Lucene in Action, Second Edition: Covers Apache Lucene 3.0 search for “Lucene”: no match Lucene in Action, Second Edition: Covers Apache Lucene 3.0 makes more sense to tokenize Example 2: ISBN 1-933-98817-7 1 933 98817 7 makes more sense to not tokenize 1 933 98817 7 search for “933”: match
  • Analysis analyzed: ● text How about URL? not analyzed: ● number ● serial number ● GUID ● checksum
  • Analysis ● character filter(s) ○ character replacement ○ e.g. accent marks with their base forms café → cafe jalapeño → jalapeno ● tokenizer ● token filter(s)
  • Analysis ● character filter(s) ● tokenizer ○ create tokens (“words”) from characters ○ sometimes straightforward ○ many unusual cases: e-mail address, URL, code, etc. ● token filter(s)
  • Analysis ● character filter(s) ● tokenizer ● token filter(s) ○ token replacement ■ change case, remove apostrophe ■ remove stop words (a, and, the, for) ■ split/join words (ice-cream, ice cream, icecream) ■ stemming (importing, imported → import) ■ synonym (nation → country)
  • Field value: Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi! Tokens (text_general): 1 2 3 4 6 6 7 8 9 10 11 12 13 14 17 let's sign up for the amazing so cal code camp at http bit.ly oZiZsu fi Tokens (text_en): 1 2 3 17 let sign up fi Tokens (text_en_splitting): 1 2 3 20 let sign up fi 6 amaz 6 amaz 7 8 9 10 so cal code camp 7 8 9 10 so cal code camp socal 12 http 12 http 13 14 bit.li ozizsu 13 14 1516 17 20 16 free wi 15 16 free wi 18 19 bit ly o zi zsu free wi httpbitlyozizsu wifi 8 15 17
  • Searching Process ● ● ● ● ● ● query parsing analysis scoring sorting loading of stored fields optional search components ○ ○ ○ ○ faceting term vector More Like This highlighting
  • Scoring ● for a given query, each document not filtered out gets a score (float) ● higher score: higher in the results ● scoring algorithms ○ default: TF-IDF ○ other: Okapi BM25, etc. ○ very customizable See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
  • Scoring - TF-IDF ● term frequency (TF) ○ how many times does this term appear in this document? ● inverse document frequency (IDF) ○ how many documents contain this term? ○ score proportional to the inverse of document frequency
  • Scoring - Other Factors ● coordination factor (coord) ○ documents that contains all or most query terms get higher scores ● normalizing factor (norm) ○ adjust for field length and query complexity
  • Scoring - Boost ● manual override: ask Lucene/Solr to give a higher score to some particular thing(s) ● index-time ○ per document ○ per field (of a particular document) ● search-time ○ per query
  • More Like This ● finds documents similar in content (of one field) to those matched ● constructs a query based on the highest scoring terms in a document ● requires the field to: ○ have stored term vectors (recommended), or ○ be stored Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
  • Spell Checking ● typos in queries happen ● returns spell checking suggestion (if any) within the same result ● can also be used for auto-complete ○ treating a prefix as a spelling mistake ○ returning full words as suggestions
  • /select?q=text:"busness comunication"&spellcheck=true&wt=xml <lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset" >14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst> </lst>
  • Query Elevation ● a.k.a. “sponsored search” ● make sure certain documents appear at the top of the results for a certain query
  • Credit: Google Web Search <http://www.google.com/>
  • Query Elevation ● configure the elevator search component in solrconfig.xml ● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude ● enable query elevation: enableElevation=true ● (optional) override the sort parameter: forceElevation=true
  • Function Query ● like formulas in Excel ● apply functions to field values for filtering and scoring
  • Function Query ● query: q={!func} cos(angle) ● query (range): q={!frange l=0.5 u=1} cos(angle) ● field: fl=angle,cos(angle) ● sort: sort=cos(angle) desc
  • Spatial Search ● data: contains locations (longitudes, latitudes) ○ e.g. merchants with store locations ● search: filter and/or sort by location
  • Credit: Google Maps <http://maps.google.com/>
  • Spatial Search ● geofilt ○ circle centered at a given point ○ distance from a given point ○ fq={!geofilt sfield=store}&pt=45.15, -93.85&d=5 ● bbox ○ square (“bounding box”) centered at a given point ○ distance from a given point + corners ○ fq={!bbox sfield=store}&pt=45.15,-93.85 &d=5 Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  • geofilt bbox 5 km (45.15, -93.85) Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> 5 km (45.15, -93.85)
  • geofilt x bbox x x x o o 5 km 5 km o o (45.15, -93.85) (45.15, -93.85) o o x Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> o
  • Spatial Search ● geodist ○ returns the distance between the location given in a field and a certain coordinate ○ e.g. sort by ascending distance from (45.15,-93.85), and return the distances as the score: q={!func}geodist()&sfield=store&pt=45. 15,-93.85&sort=score+asc Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  • Scaling/Redundancy - Problems ● collection too large for a single machine ● too many requests for a single machine ● a machine can go down
  • Scaling/Redundancy - Solutions ● collection too large for a single machine ○ distribution ■ spread the collection across multiple machines ● too many requests for a single machine ○ distribution ■ spread the requests across multiple machines ● a machine can go down ○ replication ■ copy data and configuration across multiple machines ■ make sure no single point of failure
  • SolrCloud ● Solr instances ● ZooKeeper instances
  • SolrCloud ● Solr instances ○ collection (logical index) divided into one or more partial collections (“shards”) ○ for each shard, one or more Solr instances keep copies of the data ■ one as leader - handles reads and writes ■ others as replicas - handle reads ● ZooKeeper instances
  • SolrCloud ● Solr instances ● ZooKeeper instances ○ management of Solr instances ○ leader election ○ node discovery
  • collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica leader replica replica leader replica replica replica
  • collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica leader replica replica replica leader replica replica
  • collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica (offline) leader replica replica leader replica replica
  • collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica replica leader replica replica leader replica replica
  • Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  • Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  • Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  • Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/