Search Engine-Building with Lucene and Solr
Upcoming SlideShare
Loading in...5
×
 

Search Engine-Building with Lucene and Solr

on

  • 2,111 views

These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013. ...

These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.

http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533

Statistics

Views

Total Views
2,111
Views on SlideShare
2,111
Embed Views
0

Actions

Likes
4
Downloads
61
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NoDerivs LicenseCC Attribution-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Search Engine-Building with Lucene and Solr Search Engine-Building with Lucene and Solr Presentation Transcript

  • Search Engine-Building with Lucene and Solr Kai Chan SoCal Code Camp, July 2013
  • How to Search - One Approach for each document d { if (query is a substring of d's content) { add d to the list of results } } sort the result (or not)
  • How to Search - Problems ● slow ○ reads the whole dataset for each search ● not scalable ○ if you dataset grows by 10x, your search slows down by 10x ● how to show the most relevant documents first? ○ list of results can be quite long ○ users have limited time and patience
  • Inverted Index - Introduction ● like the "index" at the end of books ● a map of one of the following types ○ term → document list ○ term → <document, position> list
  • documents: T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" inverted index (without positions): "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} inverted index (with positions): "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
  • Inverted Index - Speed ● term list ○ typically very small ○ grows slowly ● term lookup ○ O(1) to O(log(number of terms)) ● for a particular term ○ document lists: very small ○ document + position lists: still small ● few terms per query
  • Inverted Index - Relevance ● information in the index enables: ○ determination (scoring) of relevance of each document to the query ○ comparison of relevance among documents ○ sorting by (decreasing) relevance ■ i.e. the most relevant document first
  • Lucene v.s. Solr - Lucene ● full-text search library ● creates, updates and read from the index ● takes queries and produces search results ● your application creates objects and calls methods in the Lucene API ● provides building blocks for custom features
  • Lucene v.s. Solr - Solr ● full-text search server ● uses Lucene for indexing and search ● REST-like API over HTTP ● different output formats (e.g. XML, JSON) ● provides some features not built into Lucene
  • machine running Java VM your application machine running Java VM servlet container (e.g. Tomcat, Jetty) Solr Solr code Lucene code libraries index Lucene Lucene code index libraries client HTTP Lucene: Solr:
  • Workflow Setup Indexing Search
  • Workflow Setup Indexing Search
  • Workflow - Setup ● servlet configuration ○ e.g. port number, max POST size ○ you can usually use the default settings ● Solr configuration ○ e.g. data directory, deduplication, language identification, highlighting ○ you can usually use the default settings ● schema definition ○ defines fields in your documents ○ you can use the default settings if you name your fields in a certain way
  • How Data Are Organized collection document document document field field field field field field field field field
  • field content (e.g. "please read" or 30) name (e.g. "title" or "price") type options
  • index document document document subject date from subject date from date from text text reply-to text reply-to
  • index document document document subject date from title SKU price last name phone text description first name address
  • Solr Field Definition ● field ○ name (e.g. "subject") ○ type (e.g. "text_general") ○ options (e.g. indexed="true" stored="true") ● field type ○ text: "string", "text_general" ○ numeric: "int", "long", "float", "double" ● options ○ indexed: content can be searched ○ stored: content can be returned at search-time ○ multivalued: multiple values per field & document
  • Solr Dynamic Field ● define field by naming convention ● "amount_i": int, index, stored ● "tag_ss": string, indexed, stored, multivalued name type indexed stored multiValued *_i int true true false *_l long true true false *_f float true true false *_d double true true false *_s string true true false *_ss string true true true *_t text_general true true false *_txt text_general true true true
  • Solr Copy Field ● copy one or more fields into another field ● can be used to define a catch-all field ○ source: "title", "author", "description" ○ destination: "text" ○ searching the "text" field has the effect of searching all the other three fields
  • Workflow Setup Indexing Search
  • Indexing - UpdateRequestHandler ● upload content or file to http://host: port/solr/update ● formats: XML, JSON, CSV
  • XML: <add> <doc> <field name="id">apple</field> <field name="compName">Apple</field> <field name="address">1 Infinite Way, Cupertino CA</field> </doc> <doc> <field name="id">asus</field> <field name="compName">ASUS Computer</field> <field name="address">800 Corporate Way Fremont, CA 94539</field> </doc> </add> CSV: id,compName_s,address_s apple,Apple,"1 Infinite Way, Cupertino CA" asus,Asus Computer,"800 Corporate Way Fremont, CA 94539" JSON: [ {"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way, Cupertino CA"} {"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate Way Fremont, CA 94539"} ]
  • Indexing - DataImportHandler ● has its own config file (data-config.xml) ● import data from various sources ○ RDBMS (JDBC) ○ e-mail (IMAP) ○ XML data locally (file) or remotely (HTTP) ● transformers ○ extract data (RegEx, XPath) ○ manipulate data (strip HTML tags)
  • Workflow Setup Indexing Search
  • Searching - Basics ● send request to http://host:port/solr/search ● parameters ○ q - main query ○ fq - filter query ○ defType - query parser (e.g. lucene, edismax) ○ fl - fields to return ○ sort - sort criteria ○ wt - response writer (e.g. xml, json) ○ indent - set to true for pretty-printing
  • http://localhost:8983/solr/select?q=title:tablet& fl=title,price,inStock&sort=price&wt=json search handler's URL main query response writersort criteriafields to return
  • Searching - Query Syntax - Field ● search a specific field ○ field_name:value ● if field omitted, Solr uses default field: ○ df parameter in URL ○ defaultSearchField setting in schema.xml ○ "text"
  • Searching - Query Syntax - Term ● a term by itself: matches documents that contain that term ○ e.g. tablet
  • Searching - Query Syntax - Boolean ● boolean operators are supported ○ AND && ○ OR || ○ NOT ! ● e.g. a AND b ○ all of a, b must occur ● e.g. a OR b ○ at least one of a, b must occur ● e.g. a AND NOT b ○ a must occur and b must not occur
  • Searching - Query Syntax - Boolean ● Lucene/Solr's boolean operators are not true boolean operators ● e.g. a OR b OR c does not behave like (a OR b) OR c ○ instead, a OR b OR c means at least one of a, b, c must occur ● parentheses are supported
  • Searching - Query Syntax - Boolean ● "+" prefix means "must" ● "-" prefix means "must not" ● no prefix means "at least one must" (by default) ○ e.g. a b c ■ at least one of a, b, c must occur ● operators can mix ○ e.g. +a b c d -e ■ a must occur ■ at least one of b, c, d must occur ■ e must not occur
  • Searching - Query Syntax - Phrase ● phrases are enclosed by double-quotes ● e.g. +"the phrase" ○ the phrase must occur ● e.g. -"the phrase" ○ the phrase must not occur
  • Searching - Query Syntax - Boost ● manually assign different weights to clauses ● gives more weight to a field ○ e.g. title:a^10 body:a ● gives more weight to a word ○ e.g. title:a title:b^10 ● gives phrases more weight than words ○ e.g. title:(+a +b) title:"a b"^10
  • Searching - Query Syntax - Range ● matches field values within a range ○ inclusive range - denoted by square brackets ○ exclusive range - denoted by curly brackets ● e.g. age:[10 TO 20] ○ matches the field "age" with the value in 10..20 ● string or numeric comparison, depending on the field's type
  • Searching - Query Syntax - EDisMax ● suitable for user-generated queries ○ supports a subset of Lucene QP's syntax ○ does not complain about the syntax ○ searches for individual words across several fields ("disjunction") ○ uses max score of a word in all fields for scoring ("max") ● configurable (in solrconfig.xml) ○ what fields to search the words in ○ weighting of these fields
  • Sorting ● default: sorting by decreasing score ● sorting by field: using the sort parameter ○ specify field name and order ■ price asc - sort by "price" field, ascending ■ price desc - sort by "price" field, descending ○ multiple fields and orders by comma ■ starRating desc, price asc - sort by "starRating" field, descending, and then by "price" field, ascending ○ cannot use multivalued fields ○ overrides sorting by decreasing relevance
  • Faceted Search ● facet values: (distinct) values (generally non- overlapping) ranges of a field ● displaying facets ○ show possible values ○ let users narrow down their searches easily
  • facet facet values (5 of them)
  • Faceted Search ● set facet parameter to true - enables faceting ● other parameters ○ facet.field - use the field's values as facets ■ return <value, count> pairs ○ facet.query - use the given queries as facets ■ return <query, count> pairs ○ facet.sort - set the ordering of the facets; ■ can be "count" or "index" ○ facet.offset and face.limit - used for pagination of facets
  • Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4- cookbook/book
  • Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  • Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide (LucidWorks) - http: //lucene.apache.org/solr/tutorial.html
  • Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled and configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ● use the Solr admin interface ○ http://localhost:8983/solr/