Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

3,903 views
3,699 views

Published on

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.

http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,903
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

  1. 1. Search Engine-Building with Lucene and Solr Part 1 Kai Chan SoCal Code Camp, November 2013
  2. 2. Overview ● why Lucene/Solr? ● what are Lucene and Solr? ● how to use Lucene and Solr ○ setup ○ indexing ○ searching ● resources ● demo ● questions/answers
  3. 3. How to Make Your Data Searchable ● pay someone to do it ● use some solution someone else has written ● write some solution yourself
  4. 4. How to Search - One Approach for each document d { if (query is a substring of d's content) { add d to the list of results } } sort the result (or not)
  5. 5. How to Search - Problems ● slow ○ reads the whole dataset for each search ● not scalable ○ if you dataset grows by 10x, your search slows down by 10x ● how to show the most relevant documents first? ○ list of results can be quite long ○ users have limited time and patience
  6. 6. Inverted Index - Introduction ● like the "index" at the end of books ● a map of one of the following types ○ term → document list ○ term → <document, position> list
  7. 7. documents: T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" inverted index (without positions): "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} inverted index (with positions): "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
  8. 8. Inverted Index - Speed ● term list ○ typically very small ○ grows slowly ● term lookup ○ O(1) to O(log(number of terms)) ● for a particular term ○ document lists: very small ○ document + position lists: still small ● few terms per query
  9. 9. Inverted Index - Relevance ● information in the index enables: ○ determination (scoring) of relevance of each document to the query ○ comparison of relevance among documents ○ sorting by (decreasing) relevance ■ i.e. the most relevant document first
  10. 10. Lucene v.s. Solr - Lucene ● ● ● ● full-text search library creates, updates and read from the index takes queries and produces search results your application creates objects and calls methods in the Lucene API ● provides building blocks for custom features
  11. 11. Lucene v.s. Solr - Solr ● ● ● ● ● full-text search platform uses Lucene for indexing and search REST-like API over HTTP different output formats (e.g. XML, JSON) provides some features not built into Lucene
  12. 12. Lucene: machine running Java VM your application Lucene Lucene code libraries index Solr: machine running Java VM servlet container (e.g. Tomcat, Jetty) Solr HTTP Solr code Lucene code index libraries client
  13. 13. Workflow Setup Indexing Search
  14. 14. Workflow - Setup ● servlet configuration ○ e.g. port number, max POST size ○ you can usually use the default settings ● Solr configuration ○ e.g. data directory, deduplication, language identification, highlighting ○ you can usually use the default settings ● schema definition ○ defines fields in your documents ○ you can use the default settings if you name your fields in a certain way
  15. 15. How Data Are Organized collection document document document field field field field field field field field field
  16. 16. field name (e.g. "title" or "price") content (e.g. "please read" or 30) type options
  17. 17. collection document document subject subject date date date from from from reply-to reply-to text text text document
  18. 18. collection document document document subject title first name date SKU last name from price phone text description address
  19. 19. Solr Field Definition ● field ○ name (e.g. "subject") ○ type (e.g. "text_general") ○ options (e.g. indexed="true" stored="true") ● field type ○ text: "string", "text_general" ○ numeric: "int", "long", "float", "double" ● options ○ indexed: content can be searched ○ stored: content can be returned at search-time ○ multivalued: multiple values per field & document
  20. 20. Solr Dynamic Field ● define field by naming convention ● "amount_i": int, index, stored ● "tag_ss": string, indexed, stored, multivalued name type indexed stored multiValued *_i int true true false *_l long true true false *_f float true true false *_d double true true false *_s string true true false *_ss string true true true *_t text_general true true false *_txt text_general true true true
  21. 21. Solr Copy Field ● copy one or more fields into another field ● can be used to define a catch-all field ○ source: "title", "author", "description" ○ destination: "text" ○ searching the "text" field has the effect of searching all the other three fields
  22. 22. Indexing - UpdateRequestHandler ● upload (POST) content or file to http://host: port/solr/update ● formats: XML, JSON, CSV
  23. 23. XML: <add> <doc> <field <field <field </doc> <doc> <field <field <field </doc> </add> name="id">apple</field> name="compName">Apple</field> name="address">1 Infinite Way, Cupertino CA</field> name="id">asus</field> name="compName">ASUS Computer</field> name="address">800 Corporate Way Fremont, CA 94539</field> JSON: [ {"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way, Cupertino CA"} {"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate Way Fremont, CA 94539"} ] CSV: id,compName_s,address_s apple,Apple,"1 Infinite Way, Cupertino CA" asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"
  24. 24. Indexing - DataImportHandler ● has its own config file (data-config.xml) ● import data from various sources ○ RDBMS (JDBC) ○ e-mail (IMAP) ○ XML data locally (file) or remotely (HTTP) ● transformers ○ extract data (RegEx, XPath) ○ manipulate data (strip HTML tags)
  25. 25. Indexing - ExtractingRequestHandler ● allows indexing of different formats ○ e.g. PDF, MS Word, XML ● uses Apache Tika to extract text and metadata ○ Tika: a framework for different file format parsers (e. g. PDFBox for PDF, Apache POI for MS Word) ● maps extracted text to the “content” field ● maps metadata (e.g. MIME type) to different fields
  26. 26. Searching - Basics ● send request to http://host:port/solr/search ● parameters ○ ○ ○ ○ ○ ○ ○ q - main query fq - filter query defType - query parser (e.g. lucene, edismax) fl - fields to return sort - sort criteria wt - response writer (e.g. xml, json) indent - set to true for pretty-printing
  27. 27. search handler's URL main query http://localhost:8983/solr/select?q=title:tablet& fl=title,price,inStock&sort=price&wt=json fields to return sort criteria response writer
  28. 28. Searching - Query Syntax - Field ● search a specific field ○ field_name:value ● if field omitted, Solr uses default field: ○ df parameter in URL ○ defaultSearchField setting in schema.xml ○ "text"
  29. 29. Searching - Query Syntax - Term ● a term by itself: matches documents that contain that term ○ e.g. tablet
  30. 30. Searching - Query Syntax - Boolean ● “conventional” boolean operators supported ● ● ● ○ AND && ○ OR || ○ NOT ! e.g. a AND b ○ all of a, b must occur e.g. a OR b ○ at least one of a, b must occur e.g. a AND NOT b ○ a must occur and b must not occur
  31. 31. Searching - Query Syntax - Boolean ● Lucene/Solr's boolean operators are not true boolean operators ● e.g. a OR b OR c does not behave like (a OR b) OR c ○ instead, a OR b OR c means at least one of a, b, c must occur ● parentheses are supported
  32. 32. Searching - Query Syntax - Boolean ● "+" prefix means "must" ● "-" prefix means "must not" ● no prefix means "at least one must" (by default) ○ e.g. a b c ■ at least one of a, b, c must occur ● operators can mix ○ e.g. +a b c d -e ■ a must occur ■ at least one of b, c, d must occur ■ e must not occur
  33. 33. Searching - Query Syntax - Phrase ● phrases are enclosed by double-quotes ● e.g. +"the phrase" ○ the phrase must occur ● e.g. -"the phrase" ○ the phrase must not occur
  34. 34. Searching - Query Syntax - Boost ● manually assign different weights to clauses ● gives more weight to a field ○ e.g. title:a^10 body:a ● gives more weight to a word ○ e.g. title:a title:b^10 ● gives phrases more weight than words ○ e.g. title:(+a +b) title:"a b"^10
  35. 35. Searching - Query Syntax - Range ● matches field values within a range ○ inclusive range - denoted by square brackets ○ exclusive range - denoted by curly brackets ● e.g. age:[10 TO 20] ○ matches the field "age" with the value in 10..20 ● string or numeric comparison, depending on the field's type ● open-ended range supported ● e.g. age: [10 TO *] ○ matches the field "age" with the value 10 or larger
  36. 36. Searching - Query Syntax - EDisMax ● suitable for user-generated queries ○ does not complain about the syntax ○ searches for individual words across several fields ("disjunction") ○ uses max score of a word in all fields for scoring ("max") ● configurable (in solrconfig.xml) ○ what fields to search the words in ○ boosting of these fields
  37. 37. Sorting ● default: sorting by decreasing score ● custom sorting rules: use the sort parameter ○ syntax: fieldName (asc|desc) ○ e.g. sort by ascending price (i.e. lowest price first): price asc ○ e.g. sort by descending date (i.e. newest date first): date asc
  38. 38. Sorting ● special field names ○ use score for score and _docid_ for document D ○ e.g. sort by ascending score: score asc ○ e.g. sort by descending document ID _docid_ desc
  39. 39. Sorting ● multiple fields and orders: separate by commas ○ e.g. sort by descending starRating and ascending price: ○ starRating desc, price asc
  40. 40. Sorting ● cannot use multivalued fields ● overrides the default sorting behavior
  41. 41. Faceted Search ● facet values: (distinct) values (generally nonoverlapping) ranges of a field ● displaying facets ○ show possible values ○ let users narrow down their searches easily
  42. 42. facet facet values (5 of them)
  43. 43. Faceted Search ● set facet parameter to true - enables faceting ● other parameters ○ facet.field - use the field's values as facets ■ return <value, count> pairs ○ facet.query - use the given queries as facets ■ return <query, count> pairs ○ facet.sort - set the ordering of the facets; ■ can be "count" or "index" ○ facet.offset and face.limit - used for pagination of facets
  44. 44. Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  45. 45. Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  46. 46. Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  47. 47. Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/

×