Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

512 views

Published on

Slides for my presentation at SoCal Code Camp, June 29, 2014
(http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
512
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • challenges with searching
    * size of all data can be huge (GB, even TBs)
    ** searching by going through all data can take too long and might not work
    * often, a user is only going to see a tiny fraction of the matched data (limited time and attention)
    ** essential to show the most relevant result first/at the top

    black box
    * benefits: speed, relevance
    * cost: pre-processing (time, space) - indexing
  • * collections - all data you have
    * a collection can have many documents
    * a document can have many fields
  • a field can have
    * name
    * content
    * type and options (will talk about them later)
  • * each field is optional, i.e. a particular document doesn’t have to have every field
  • * in fact, a collection can contain different kinds of documents, with different fields among them
    * e-mail
    * product
    * contact
  • * these are just examples
    * Solr documentation has the full list
  • * Solr’s documentation has the exact formats required
  • * part before colon is the field name, part after colon is the field value
    * search for phrase: quote the phrase with double-quotes
    * separate two or more clauses by space: a document must match any of the clauses, for the document to be in the result set
    * “+” before a clause: a document must match the clause, for the document to be in the result set
  • * parentheses: group clauses
    * “-” before a clause: a document must NOT match the clause, for the document to be in the result set
    * to match a range, surround the lower bound and upper bound with square brackets
    * boost a clause by adding “^” and a number (>1: more emphasis, <1: less emphasis)
  • things to configure in solrconfig.xml:
    * what fields to search the words in
    * boosting of these fields
  • special field names:
    * “ score”: document score
    * _docid_: document ID
  • e.g. merchants with store locations
  • distribution
    * spread the collection across multiple machines

    distribution
    * spread the requests across multiple machines

    replication
    * copy data and configuration across multiple machines
    * make sure no single point of failure
  • Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

    1. 1. Search Engine-Building with Lucene and Solr Kai Chan SoCal Code Camp, June 2014 http://bit.ly/sdcodecamp2014solr
    2. 2. all data matched data data that a user actually sees
    3. 3. Lucene ● full-text search library ● creates, updates and read from the index ● takes queries and produces search results ● your application creates objects and calls methods in the Lucene API ● provides building blocks for custom features
    4. 4. Solr ● full-text search platform ● uses Lucene for indexing and search ● REST-like API over HTTP ● different output formats (e.g. XML, JSON) ● provides some features not built into Lucene
    5. 5. machine running Java VM your application machine running Java VM servlet container (e.g. Tomcat, Jetty) Solr Solr code Lucene code libraries index Lucene Lucene code index libraries client HTTP Lucene : Solr:
    6. 6. How Data Are Organized collection document document document field field field field field field field field field
    7. 7. field content (e.g. "please read" or 30) name (e.g. "title" or "price") type options
    8. 8. collection document document document subject date from subject date from date from text text reply-to text reply-to
    9. 9. collection document document document subject date from title SKU price last name phone text description first name address
    10. 10. Solr Field Definition ● field o name (e.g. "subject") o type (e.g. "text_general") o options (e.g. indexed="true" stored="true") ● field type o text: "string", "text_general" o numeric: "int", "long", "float", "double" ● options o indexed: content can be searched o stored: content can be returned at search-time o multivalued: multiple values per field & document
    11. 11. Solr Dynamic Field ● define field by naming convention ● "amount_i": int, index, stored ● "tag_ss": string, indexed, stored, multivalued
    12. 12. Solr Copy Field ● copy one or more fields into another field ● can be used to define a catch-all field o source: "title", "author", "content" o destination: "text" o searching the "text" field has the effect of searching all the other three fields
    13. 13. Indexing - UpdateRequestHandler ● upload (POST) content or file to http://host:port/solr/update ● formats: XML, JSON, CSV
    14. 14. Indexing - DataImportHandler ● has its own config file (data-config.xml) ● import data from various sources o RDBMS (JDBC) o e-mail (IMAP) o XML data locally (file) or remotely (HTTP) ● transformers o extract data (RegEx, XPath) o manipulate data (strip HTML tags)
    15. 15. Indexing - ExtractingRequestHandler ● allows indexing of different formats o e.g. PDF, MS Word, XML ● extract text and metadata ● maps extracted text to the “content” field ● maps metadata to different fields
    16. 16. Searching - Basics ● send request to http://host:port/solr/search ● parameters o q - main query o fq - filter query o defType - query parser (e.g. lucene, edismax) o fl - fields to return o sort - sort criteria o wt - response writer (e.g. xml, json) o indent - set to true for pretty-printing
    17. 17. http://localhost:8983/solr/select?q=title:tablet& fl=title,price,inStock&sort=price&wt=json search handler's URL main query response writersort criteriafields to return
    18. 18. Searching - Query Syntax name:tablet name:”galaxy tab” name:tablet category:tablet +name:tablet +category:tablet
    19. 19. Searching - Query Syntax (cont.) +name:tablet +(manu:apple manu:samsung) +name:tablet -manu:apple +name:tablet +range:[300 TO 500] +name:tablet manu:apple^5
    20. 20. EDisMax Parser ● suitable for user-generated queries o does not complain about the syntax o does not require field name in query o searches across several fields ● configurable
    21. 21. ● default: sorting by decreasing score ● custom sorting rules: use the sort parameter o syntax: fieldName (asc|desc) o e.g. sort by ascending price (i.e. lowest price first):price asc o e.g. sort by descending date (i.e. newest date first):date asc Sorting
    22. 22. Sorting ● multiple fields and orders: separate by commas o e.g. sort by descending starRating and ascending price: o starRating desc, price asc
    23. 23. Sorting ● cannot use multivalued fields ● overrides the default sorting behavior
    24. 24. Faceted Search ● facet values: (distinct) values (generally non- overlapping) ranges of a field ● displaying facets o show possible values o let users narrow down their searches easily
    25. 25. facet facet values (5 of them)
    26. 26. Faceted Search ● set facet parameter to true - enables faceting ● other parameters o facet.field - use the field's values as facets  return <value, count> pairs o facet.query - use the given queries as facets  return <query, count> pairs o facet.sort - set the ordering of the facets;  can be "count" or "index" o facet.offset and face.limit - used for pagination of facets
    27. 27. Spatial Search ● data: locations (longitudes, latitudes) ● search: filter and/or sort by location
    28. 28. Filter by Location ● geofilt o circle centered at a given point o distance from a given point o fq={!geofilt sfield=store}&pt=45.15,- 93.85&d=5 ● bbox o square (“bounding box”) centered at a given point o distance from a given point + corners o fq={!bbox sfield=store}&pt=45.15,- 93.85&d=5 Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
    29. 29. geofilt bbox 5 km 5 km (45.15, -93.85) (45.15, -93.85) Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
    30. 30. geofilt bbox 5 km 5 km (45.15, -93.85) (45.15, -93.85) x o o x x x o o o o x o Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
    31. 31. Sort by Location ● geodist o returns the distance between the location given in a field and a certain coordinate o e.g. sort by ascending distance from (45.15,-93.85), and return the distances as the score:q={!func}geodist()&sfield=store&pt =45.15,-93.85&sort=score+asc Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
    32. 32. Scaling/Redundancy problem solution collection too large for a single machine distribution too many requests for a single machine distribution a machine can go down replication
    33. 33. SolrCloud ● Solr instances o collection (logical index) divided into one or more partial collections (“shards”) o for each shard, one or more Solr instances keep copies of the data  one as leader - handles reads and writes  others as replicas - handle reads ● ZooKeeper instances
    34. 34. SolrCloud ● Solr instances ● ZooKeeper instances o management of Solr instances o leader election o node discovery
    35. 35. leader replica replica leader replica leader replica shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection collection (i.e. logical index) replica replica replica
    36. 36. leader replica replica leader replica leader replica shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection collection (i.e. logical index) replica replica replica replica
    37. 37. leader replica replica (offline) leader leader replica shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection collection (i.e. logical index) replica replica replica replica
    38. 38. leader replica replica replica leader leader replica shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection collection (i.e. logical index) replica replica replica replica
    39. 39. Resources - Books ● Solr in Action o just released, up-to-date o http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook o common problems and useful tips o http://www.packtpub.com/apache-solr-4- cookbook/book ● Lucene in Action o written by 3 committer and PMC members o somewhat outdated (2010; covers Lucene 3.0) o http://www.manning.com/hatcher3/
    40. 40. Resources - Books ● Introduction to Information Retrieval o not specific to Lucene/Solr, but about IR concepts o free e-book o http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes o indexing, compression and other topics o accompanied by MG4J - a full-text search software o http://mg4j.di.unimi.it/
    41. 41. Resources - Web ● official website o http://lucene.apache.org/ o Wiki o reference guide o mailing list ● StackOverflow o http://stackoverflow.com/ o “Lucene” and “Solr” tags
    42. 42. Getting Started ● download Solr o requires Java 7 or newer to run ● Solr comes bundled/configured with Jetty o <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents o <Solr directory>/example/exampledocs/post.jar o java - Durl=http://localhost:8983/solr/update -jar post.jar *.xml ● use the Solr admin interface o http://localhost:8983/solr/
    43. 43. Thanks for Coming! ● Java Performance Tips @ 10:15, same room ● slides available o http://bit.ly/sdcodecamp2014solr ● please vote for my conference session o http://bit.ly/tvnews2014 ● questions/feedback o kai@ssc.ucla.edu ● questions?

    ×