0
Search Engine-Building
with Lucene and Solr
Part 2
Kai Chan
SoCal Code Camp, November 2013
Overview
●
●
●
●
●
●
●

indexing process
searching process
advanced features
scaling/redundancy
resources
demo
questions/a...
Indexing Process
● request handler
○ data are read to create documents

● update request processor chain
○
○
○
○

optional...
Update Request Processor Chain
● de-duplication
○ creates a signature (hash) for each document to be
added
○ replaces (del...
Update Request Processor Chain
● language detection
○ detects the language used in field(s)
○ adds a language field to the...
Analysis
● analyzed
○ tokenization, i.e. breaking down the content to be
search into smaller units (“tokens”)
○ manipulati...
Example 1: book title
Lucene in Action, Second Edition: Covers Apache Lucene 3.0

Lucene in Action, Second Edition: Covers...
Analysis
analyzed:
● text

How about URL?

not analyzed:
● number
● serial number
● GUID
● checksum
Analysis
● character filter(s)
○ character replacement
○ e.g. accent marks with their base forms
café → cafe
jalapeño → ja...
Analysis
● character filter(s)
● tokenizer
○ create tokens (“words”) from characters
○ sometimes straightforward
○ many un...
Analysis
● character filter(s)
● tokenizer
● token filter(s)
○ token replacement
■ change case, remove apostrophe
■ remove...
Field value:
Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi!

Tokens (text_general):
1...
Searching Process
●
●
●
●
●
●

query parsing
analysis
scoring
sorting
loading of stored fields
optional search components
...
Scoring
● for a given query, each document not filtered
out gets a score (float)
● higher score: higher in the results
● s...
Scoring - TF-IDF
● term frequency (TF)
○ how many times does this term appear in this
document?

● inverse document freque...
Scoring - Other Factors
● coordination factor (coord)
○ documents that contains all or most query terms get
higher scores
...
Scoring - Boost
● manual override: ask Lucene/Solr to give a
higher score to some particular thing(s)
● index-time
○ per d...
More Like This
● finds documents similar in content (of one
field) to those matched
● constructs a query based on the high...
Spell Checking
● typos in queries happen
● returns spell checking suggestion (if any)
within the same result
● can also be...
/select?q=text:"busness comunication"&spellcheck=true&wt=xml

<lst name="spellcheck">
<lst name="suggestions">
<lst name="...
Query Elevation
● a.k.a. “sponsored search”
● make sure certain documents appear at the
top of the results for a certain q...
Credit: Google Web Search <http://www.google.com/>
Query Elevation
● configure the elevator search component
in solrconfig.xml
● in elevate.xml, specify the queries and
the ...
Function Query
● like formulas in Excel
● apply functions to field values for filtering
and scoring
Function Query
● query:
q={!func} cos(angle)
● query (range):
q={!frange l=0.5 u=1} cos(angle)
● field:
fl=angle,cos(angle...
Spatial Search
● data: contains locations
(longitudes, latitudes)
○ e.g. merchants with store locations

● search: filter ...
Credit: Google Maps <http://maps.google.com/>
Spatial Search
● geofilt
○ circle centered at a given point
○ distance from a given point
○ fq={!geofilt sfield=store}&pt=...
geofilt

bbox

5 km

(45.15, -93.85)

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

5 km

(45.15, -...
geofilt

x

bbox

x

x
x

o
o

5 km

5 km
o

o

(45.15, -93.85)

(45.15, -93.85)

o

o

x

Credit: Apache Solr Reference G...
Spatial Search
● geodist
○ returns the distance between the location given in a
field and a certain coordinate
○ e.g. sort...
Scaling/Redundancy - Problems
● collection too large for a single machine
● too many requests for a single machine
● a mac...
Scaling/Redundancy - Solutions
● collection too large for a single machine
○ distribution
■ spread the collection across m...
SolrCloud
● Solr instances
● ZooKeeper instances
SolrCloud
● Solr instances
○ collection (logical index) divided into one or more
partial collections (“shards”)
○ for each...
SolrCloud
● Solr instances
● ZooKeeper instances
○ management of Solr instances
○ leader election
○ node discovery
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection
...
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection
...
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection
...
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection
...
Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0...
Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-bo...
Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/...
Getting Started
● download Solr
○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty
○ <Solr dire...
Upcoming SlideShare
Loading in...5
×

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

466

Published on

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.

http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
466
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)"

  1. 1. Search Engine-Building with Lucene and Solr Part 2 Kai Chan SoCal Code Camp, November 2013
  2. 2. Overview ● ● ● ● ● ● ● indexing process searching process advanced features scaling/redundancy resources demo questions/answers
  3. 3. Indexing Process ● request handler ○ data are read to create documents ● update request processor chain ○ ○ ○ ○ optional document-wide processing fields can be added, changed, removed analysis creation of indexed and stored fields ● update handler ○ the index is updated
  4. 4. Update Request Processor Chain ● de-duplication ○ creates a signature (hash) for each document to be added ○ replaces (delete) existing documents with the same signature ○ MD5Signature ■ exact hashing ○ Lookup3Signature ■ faster calculation and smaller hash than MD5 ○ TextProfileSignature ■ fuzzy hashing, near-duplicate detection
  5. 5. Update Request Processor Chain ● language detection ○ detects the language used in field(s) ○ adds a language field to the document ○ TikaLanguageIdentifierUpdateProcessorFa ctory ■ uses Apache Tika ○ LangDetectLanguageIdentifierUpdateProce ssorFactory ■ uses language-detection library ○ external programs ■ e.g. Chromium Compact Language Detector See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>
  6. 6. Analysis ● analyzed ○ tokenization, i.e. breaking down the content to be search into smaller units (“tokens”) ○ manipulation of tokens ● not analyzed ○ the whole content treated as 1 unit for searching ● analyzed v.s. not analyzed ○ are individual tokens meaningful on their own? ○ are individual tokens used in queries?
  7. 7. Example 1: book title Lucene in Action, Second Edition: Covers Apache Lucene 3.0 Lucene in Action, Second Edition: Covers Apache Lucene 3.0 search for “Lucene”: no match Lucene in Action, Second Edition: Covers Apache Lucene 3.0 makes more sense to tokenize Example 2: ISBN 1-933-98817-7 1 933 98817 7 makes more sense to not tokenize 1 933 98817 7 search for “933”: match
  8. 8. Analysis analyzed: ● text How about URL? not analyzed: ● number ● serial number ● GUID ● checksum
  9. 9. Analysis ● character filter(s) ○ character replacement ○ e.g. accent marks with their base forms café → cafe jalapeño → jalapeno ● tokenizer ● token filter(s)
  10. 10. Analysis ● character filter(s) ● tokenizer ○ create tokens (“words”) from characters ○ sometimes straightforward ○ many unusual cases: e-mail address, URL, code, etc. ● token filter(s)
  11. 11. Analysis ● character filter(s) ● tokenizer ● token filter(s) ○ token replacement ■ change case, remove apostrophe ■ remove stop words (a, and, the, for) ■ split/join words (ice-cream, ice cream, icecream) ■ stemming (importing, imported → import) ■ synonym (nation → country)
  12. 12. Field value: Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi! Tokens (text_general): 1 2 3 4 6 6 7 8 9 10 11 12 13 14 17 let's sign up for the amazing so cal code camp at http bit.ly oZiZsu fi Tokens (text_en): 1 2 3 17 let sign up fi Tokens (text_en_splitting): 1 2 3 20 let sign up fi 6 amaz 6 amaz 7 8 9 10 so cal code camp 7 8 9 10 so cal code camp socal 12 http 12 http 13 14 bit.li ozizsu 13 14 1516 17 20 16 free wi 15 16 free wi 18 19 bit ly o zi zsu free wi httpbitlyozizsu wifi 8 15 17
  13. 13. Searching Process ● ● ● ● ● ● query parsing analysis scoring sorting loading of stored fields optional search components ○ ○ ○ ○ faceting term vector More Like This highlighting
  14. 14. Scoring ● for a given query, each document not filtered out gets a score (float) ● higher score: higher in the results ● scoring algorithms ○ default: TF-IDF ○ other: Okapi BM25, etc. ○ very customizable See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
  15. 15. Scoring - TF-IDF ● term frequency (TF) ○ how many times does this term appear in this document? ● inverse document frequency (IDF) ○ how many documents contain this term? ○ score proportional to the inverse of document frequency
  16. 16. Scoring - Other Factors ● coordination factor (coord) ○ documents that contains all or most query terms get higher scores ● normalizing factor (norm) ○ adjust for field length and query complexity
  17. 17. Scoring - Boost ● manual override: ask Lucene/Solr to give a higher score to some particular thing(s) ● index-time ○ per document ○ per field (of a particular document) ● search-time ○ per query
  18. 18. More Like This ● finds documents similar in content (of one field) to those matched ● constructs a query based on the highest scoring terms in a document ● requires the field to: ○ have stored term vectors (recommended), or ○ be stored Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
  19. 19. Spell Checking ● typos in queries happen ● returns spell checking suggestion (if any) within the same result ● can also be used for auto-complete ○ treating a prefix as a spelling mistake ○ returning full words as suggestions
  20. 20. /select?q=text:"busness comunication"&spellcheck=true&wt=xml <lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset" >14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst> </lst>
  21. 21. Query Elevation ● a.k.a. “sponsored search” ● make sure certain documents appear at the top of the results for a certain query
  22. 22. Credit: Google Web Search <http://www.google.com/>
  23. 23. Query Elevation ● configure the elevator search component in solrconfig.xml ● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude ● enable query elevation: enableElevation=true ● (optional) override the sort parameter: forceElevation=true
  24. 24. Function Query ● like formulas in Excel ● apply functions to field values for filtering and scoring
  25. 25. Function Query ● query: q={!func} cos(angle) ● query (range): q={!frange l=0.5 u=1} cos(angle) ● field: fl=angle,cos(angle) ● sort: sort=cos(angle) desc
  26. 26. Spatial Search ● data: contains locations (longitudes, latitudes) ○ e.g. merchants with store locations ● search: filter and/or sort by location
  27. 27. Credit: Google Maps <http://maps.google.com/>
  28. 28. Spatial Search ● geofilt ○ circle centered at a given point ○ distance from a given point ○ fq={!geofilt sfield=store}&pt=45.15, -93.85&d=5 ● bbox ○ square (“bounding box”) centered at a given point ○ distance from a given point + corners ○ fq={!bbox sfield=store}&pt=45.15,-93.85 &d=5 Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  29. 29. geofilt bbox 5 km (45.15, -93.85) Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> 5 km (45.15, -93.85)
  30. 30. geofilt x bbox x x x o o 5 km 5 km o o (45.15, -93.85) (45.15, -93.85) o o x Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> o
  31. 31. Spatial Search ● geodist ○ returns the distance between the location given in a field and a certain coordinate ○ e.g. sort by ascending distance from (45.15,-93.85), and return the distances as the score: q={!func}geodist()&sfield=store&pt=45. 15,-93.85&sort=score+asc Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  32. 32. Scaling/Redundancy - Problems ● collection too large for a single machine ● too many requests for a single machine ● a machine can go down
  33. 33. Scaling/Redundancy - Solutions ● collection too large for a single machine ○ distribution ■ spread the collection across multiple machines ● too many requests for a single machine ○ distribution ■ spread the requests across multiple machines ● a machine can go down ○ replication ■ copy data and configuration across multiple machines ■ make sure no single point of failure
  34. 34. SolrCloud ● Solr instances ● ZooKeeper instances
  35. 35. SolrCloud ● Solr instances ○ collection (logical index) divided into one or more partial collections (“shards”) ○ for each shard, one or more Solr instances keep copies of the data ■ one as leader - handles reads and writes ■ others as replicas - handle reads ● ZooKeeper instances
  36. 36. SolrCloud ● Solr instances ● ZooKeeper instances ○ management of Solr instances ○ leader election ○ node discovery
  37. 37. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica leader replica replica leader replica replica replica
  38. 38. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica leader replica replica replica leader replica replica
  39. 39. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica (offline) leader replica replica leader replica replica
  40. 40. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica replica leader replica replica leader replica replica
  41. 41. Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  42. 42. Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  43. 43. Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  44. 44. Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×