Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to Search Engine-     Building with Lucene             Kai Chan    SoCal Code Camp, June 2012
How to Search• One (common) approach to searching all your  documents:for each document d {  if (query is a substring of d...
How to Search• Problems  – Slow: Reads the whole database for each search  – Not scalable: If your database grows by 10x, ...
Inverted Index• (term -> document list) mapDocuments:   T0 = "it is what it is"             T1 = "what is it"             ...
Inverted Index• (term -> <document, position> list) map T0 = "it is what it is”       0 1 2      3 4 T1 = "what is it”    ...
Inverted Index• (term -> <document, position> list) map T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana" "a...
Inverted Index• Speed  – Term list     • Very small compared to documents’ content     • Tends to grow at a slower speed t...
Inverted Index• Relevance  – Extra information in the index     • Stored in a easily accessible way     • Determine releva...
Determining Relevancy• Two models used in the searching process  – Boolean model     • AND, OR, NOT, etc.     • Either a d...
Determining RelevancyLucene uses both models     all documents             filtering (Boolean Model)   some documents     ...
Vector Space Modelf(frequency of term B)                         document 1                                        query  ...
Scoring• Term frequency (TF)  – How many times does this term (t) appear in this    document (d)?  – Score proportional to...
Scoring• Coordination factor (coord)  – Documents that contains all or most query terms    get higher scores• Normalizing ...
Scoring• Boost  – “Manual override”: ask Lucene to give a higher    score to some particular thing  – Index-time     • Doc...
Scoring                     coordination factor           query normalizing factor       score(q, d) = coord(q, d) . query...
Work Flow• Indexing  – Index: storage of inverted index + documents  – Add fields to a document  – Add the document to the...
Adding Field to Document• Store?• Index?  – Analyzed (split text into multiple terms)  – Not analyzed (treat the whole tex...
Analyzed vs. Not Analyzed             Text: “the quick brown fox”Analyzed: 4 terms                Not analyzed: 1 term1. t...
Index-time Analysis• Analyzer  – Determine which TokenStream classes to use• TokenStream  – Does the actual hard work  – T...
Text:San Franciso, the Bay Area’s city-countyhttp://www.ci.sf.ca.us controller@sfgov.orgWhitespaceAnalyzer:[San] [Francisc...
Notable TokenStream Classes• ASCIIFoldingFilter  – Converts alphabetic characters into basic forms• PorterStemFilter  – Re...
Tokens• Information about a token  – Field  – Text  – Start offset, end offset  – Position increment                      ...
Attributes• Past versions of Lucene: Token object• Recent version of Lucene: attributes  – Efficiency, flexibility  – Ask ...
create token streamTokenStream tokenStream =analyzer.reusableTokenStream(fieldName, reader);tokenStream.reset();CharTermAt...
Query-time Analysis• Text in a query is analyzed like fields• Use the same analyzer that analyzed the  particular field +f...
Query Formation• Query parsing  – A query parser in core code  – Additional query parsers in contributed code• Or build qu...
Term Query• Matches documents with a particular term  – Field  – Text                                             26
Term Range Query• Matches documents with any of the terms in a  particular range  – Field  – Lowest term text  – Highest t...
Prefix Query• Matches documents with any of the terms  with a particular prefix  – Field  – Prefix                        ...
Wildcard/Regex Query• Matches documents with any of the terms  that match a particular pattern  – Field  – Pattern     • W...
Fuzzy Query• Matches documents with any of the terms  that are “similar” to a particular term  – Levenshtein distance (“ed...
Phrase Query• Matches documents with all the given words  present and being “near” each other  – Field  – Terms  – Slop   ...
Boolean Query• Conceptually similar to boolean operators  (“AND”, “OR”, “NOT”), but not identical• Why Not AND, OR, And NO...
Boolean Query• Three types of clauses  – Must  – Should  – Must not• For a boolean query to match a document  – All “must”...
Span Query• Similar to other queries, but matches spans• Span  – particular place/part of a particular document  – <docume...
T0 = "it is what it is”      0 1 2      3 4T1 = "what is it”      0    1 2T2 = "it is a banana”      0 1 2 3           <do...
Span Query• SpanTermQuery  – Same as TermQuery, except your can build other    span queries with it• SpanOrQuery  – Matche...
spanTerm(apple)                 spanOr([apple, orange])apple                  orange   apple                  orange      ...
Span Query• SpanNearQuery  – Matches spans that are within a certain “slop” of    each other  – Slop: max number of positi...
the                quick           brown       fox                     2                 1             01. spanNear([brown...
Filtering• A Filter narrows down the search result  – Creates a set of document IDs  – Decides what documents get processe...
Notable Filter classes• TermsFilter   – Allows documents with any of the given terms• TermRangeFilter   – Filter version o...
Sorting• Score (default)• Index order• Field  – Requires the field be indexed & not analyzed  – Specify type (string, int,...
Interfacing Lucene with “Outside”• Embedding directly• Language bridge  – E.g. PHP/Java Bridge• Web service  – E.g. Jetty ...
Books• Lucene in Action, 2nd Edition  – Written by 3 committers and PMC members  – http://www.manning.com/hatcher3/• Intro...
Web Resources• Official Website   – http://lucene.apache.org/• StackOverflow   – http://stackoverflow.com/questions/tagged...
Getting Started• Getting started  – Download lucene-3.6.0.zip (or .tgz)  – Add lucene-core-3.6.0.jar to your classpath  – ...
47
Upcoming SlideShare
Loading in …5
×

Introduction to search engine-building with Lucene

1,860 views

Published on

These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012.

http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe

Published in: Technology, Business
  • Be the first to comment

Introduction to search engine-building with Lucene

  1. 1. Introduction to Search Engine- Building with Lucene Kai Chan SoCal Code Camp, June 2012
  2. 2. How to Search• One (common) approach to searching all your documents:for each document d { if (query is a substring of d’s content) { add d to the list of results }}sort the results 1
  3. 3. How to Search• Problems – Slow: Reads the whole database for each search – Not scalable: If your database grows by 10x, your search slows down by 10x – How to show the most relevant documents first? 2
  4. 4. Inverted Index• (term -> document list) mapDocuments: T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana"Inverted "a": {2}index: "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} E 3
  5. 5. Inverted Index• (term -> <document, position> list) map T0 = "it is what it is” 0 1 2 3 4 T1 = "what is it” 0 1 2 T2 = "it is a banana” 0 1 2 3 E 4
  6. 6. Inverted Index• (term -> <document, position> list) map T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana" "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} E 5
  7. 7. Inverted Index• Speed – Term list • Very small compared to documents’ content • Tends to grow at a slower speed than documents (after a certain level) – Term lookup: O(1) to O(log of number of terms) – Document lists are very small – Document + position lists still small 6
  8. 8. Inverted Index• Relevance – Extra information in the index • Stored in a easily accessible way • Determine relevance of each document to the query – Enables sorting by (decreasing) relevance 7
  9. 9. Determining Relevancy• Two models used in the searching process – Boolean model • AND, OR, NOT, etc. • Either a document matches a query, or not – Vector space model • How often a query term appears in a document vs. how often the term appears in all documents • Scoring and sorting by relevancy possible 8
  10. 10. Determining RelevancyLucene uses both models all documents filtering (Boolean Model) some documents (unsorted) scoring (Vector Space Model) some documents (sorted by score) 9
  11. 11. Vector Space Modelf(frequency of term B) document 1 query document 2 f(frequency of term A) 10
  12. 12. Scoring• Term frequency (TF) – How many times does this term (t) appear in this document (d)? – Score proportional to TF• Document frequency (DF) – How many documents have this term (t)? – Score proportional to the inverse of DF (IDF) 11
  13. 13. Scoring• Coordination factor (coord) – Documents that contains all or most query terms get higher scores• Normalizing factor (norm) – Adjust for field length and query complexity 12
  14. 14. Scoring• Boost – “Manual override”: ask Lucene to give a higher score to some particular thing – Index-time • Document • Field (of a particular document) – Search-time • Query 13
  15. 15. Scoring coordination factor query normalizing factor score(q, d) = coord(q, d) . queryNorm(q) . Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d)) term inverse frequency document frequency term boost document boost, field boost, length normalizing factorhttp://lucene.apache.org/core/3_6_0/scoring.html 14
  16. 16. Work Flow• Indexing – Index: storage of inverted index + documents – Add fields to a document – Add the document to the index – Repeat for every document• Searching – Generate a query – Search with this query – Get back a sorted document list (top N docs) 15
  17. 17. Adding Field to Document• Store?• Index? – Analyzed (split text into multiple terms) – Not analyzed (treat the whole text as ONE term) – Not indexed (this field will not be searchable) – Store norms? 16
  18. 18. Analyzed vs. Not Analyzed Text: “the quick brown fox”Analyzed: 4 terms Not analyzed: 1 term1. the 1. the quick brown fox2. quick3. brown4. fox 17
  19. 19. Index-time Analysis• Analyzer – Determine which TokenStream classes to use• TokenStream – Does the actual hard work – Tokenizer: text to tokens – Token filter: tokens to tokens 18
  20. 20. Text:San Franciso, the Bay Area’s city-countyhttp://www.ci.sf.ca.us controller@sfgov.orgWhitespaceAnalyzer:[San] [Francisco,] [the] [Bay] [Area’s][city-county] [http://www.ci.sf.ca.us/][controller@sfgov.org]StopAnalyzer:[san] [francisco] [bay] [area] [s] [city] [county][http] [www] [ci] [sf] [ca] [us] [controller][sfgov] [org]StandardAnalyzer:[san] [francisco] [bay] [areas] [city] [county][http] [www.ci.fs.ca.us] [controller] [sfgov.org] 19
  21. 21. Notable TokenStream Classes• ASCIIFoldingFilter – Converts alphabetic characters into basic forms• PorterStemFilter – Reduces tokens into their stems• SynonymTokenFilter – Converts words to their synonyms• ShingleFilter – Creates shingles (n-grams) 20
  22. 22. Tokens• Information about a token – Field – Text – Start offset, end offset – Position increment 21
  23. 23. Attributes• Past versions of Lucene: Token object• Recent version of Lucene: attributes – Efficiency, flexibility – Ask for attributes you want – Receive attribute objects – Use these object for information about tokens 22
  24. 24. create token streamTokenStream tokenStream =analyzer.reusableTokenStream(fieldName, reader);tokenStream.reset();CharTermAttribute term = obtain eachstream.addAttribute(CharTermAttribute.class); attribute you want to knowOffsetAttribute offset =stream.addAttribute(OffsetAttribute.class);PositionIncrementAttribute posInc =stream.addAttribute(PositionIncrementAttribute.class);while (tokenStream.incrementToken()) { go to the next token doSomething(term.toString(), offset.startOffset(), use information about offset.endOffset(), the current token posInc.getPositionIncrement());}tokenStream.end(); close token streamtokenStream.close(); 23
  25. 25. Query-time Analysis• Text in a query is analyzed like fields• Use the same analyzer that analyzed the particular field +field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”) quick brown fox lazy dog cozy cat 24
  26. 26. Query Formation• Query parsing – A query parser in core code – Additional query parsers in contributed code• Or build query from the Lucene query classes 25
  27. 27. Term Query• Matches documents with a particular term – Field – Text 26
  28. 28. Term Range Query• Matches documents with any of the terms in a particular range – Field – Lowest term text – Highest term text – Include lowest term text? – Include highest term text? 27
  29. 29. Prefix Query• Matches documents with any of the terms with a particular prefix – Field – Prefix 28
  30. 30. Wildcard/Regex Query• Matches documents with any of the terms that match a particular pattern – Field – Pattern • Wildcard: * for 0+ characters, ? for 0-1 character • Regular expression • Pattern matching on individual terms only 29
  31. 31. Fuzzy Query• Matches documents with any of the terms that are “similar” to a particular term – Levenshtein distance (“edit distance”): Number of character insertions, deletions or substitutions needed to transform one string into another • e.g. kitten -> sitten -> sittin -> sitting (3 edits) – Field – Text – Minimum similarity score 30
  32. 32. Phrase Query• Matches documents with all the given words present and being “near” each other – Field – Terms – Slop • Number of “moves of words” permitted • Slop = 0 means exact phrase match required 31
  33. 33. Boolean Query• Conceptually similar to boolean operators (“AND”, “OR”, “NOT”), but not identical• Why Not AND, OR, And NOT? – http://www.lucidimagination.com/blog/2011/12/ 28/why-not-and-or-and-not/ – In short, boolean operators do not handle > 2 clauses well 32
  34. 34. Boolean Query• Three types of clauses – Must – Should – Must not• For a boolean query to match a document – All “must” clauses must match – All “must not” clauses must not match – At least one “must” or “should” clause must match 33
  35. 35. Span Query• Similar to other queries, but matches spans• Span – particular place/part of a particular document – <document ID, start position, end position> tuple 34
  36. 36. T0 = "it is what it is” 0 1 2 3 4T1 = "what is it” 0 1 2T2 = "it is a banana” 0 1 2 3 <doc ID, start pos., end pos.>“it is”: <0, 0, 2> <0, 3, 5> <2, 0, 2> 35
  37. 37. Span Query• SpanTermQuery – Same as TermQuery, except your can build other span queries with it• SpanOrQuery – Matches spans that are matched by any of some span queries• SpanNotQuery – Matches spans that are matched by one span query but not the other span query 36
  38. 38. spanTerm(apple) spanOr([apple, orange])apple orange apple orange spanTerm(orange) spanNot(apple, orange) 37
  39. 39. Span Query• SpanNearQuery – Matches spans that are within a certain “slop” of each other – Slop: max number of positions between spans – Can specify whether order matters 38
  40. 40. the quick brown fox 2 1 01. spanNear([brown, fox, the, quick], slop = 4, inOrder = false) ✔2. spanNear([brown, fox, the, quick], slop = 3, inOrder = false) ✔3. spanNear([brown, fox, the, quick], slop = 2, inOrder = false) ✖4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true) ✖5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true) ✔ 39
  41. 41. Filtering• A Filter narrows down the search result – Creates a set of document IDs – Decides what documents get processed further – Does not affect scoring, i.e. does not score/rank documents that pass the filter – Can be cached easily – Useful for access control, presets, etc. 40
  42. 42. Notable Filter classes• TermsFilter – Allows documents with any of the given terms• TermRangeFilter – Filter version of TermRangeQuery• PrefixFilter – Filter version of PrefixQuery• QueryWrapperFilter – “Adapts” a query into a filter• CachingWrapperFilter – Cache the result of the wrapped filter 41
  43. 43. Sorting• Score (default)• Index order• Field – Requires the field be indexed & not analyzed – Specify type (string, int, etc.) – Normal or reverse order – Single or multiple fields 42
  44. 44. Interfacing Lucene with “Outside”• Embedding directly• Language bridge – E.g. PHP/Java Bridge• Web service – E.g. Jetty + your own request handler• Solr – Lucene + Jetty + lots of useful functionality 43
  45. 45. Books• Lucene in Action, 2nd Edition – Written by 3 committers and PMC members – http://www.manning.com/hatcher3/• Introduction to Information Retrieval – Not specific to Lucene, but about IR concepts – Free e-book – http://nlp.stanford.edu/IR-book/ 44
  46. 46. Web Resources• Official Website – http://lucene.apache.org/• StackOverflow – http://stackoverflow.com/questions/tagged/lucene• Mailing lists – http://lucene.apache.org/core/discussion.html• Blogs – http://www.lucidimagination.com/blog/ – http://blog.mikemccandless.com/ – http://lucene.grantingersoll.com/ 45
  47. 47. Getting Started• Getting started – Download lucene-3.6.0.zip (or .tgz) – Add lucene-core-3.6.0.jar to your classpath – Consider using an IDE (e.g. Eclipse) – Luke (Lucene Index Toolbox) http://code.google.com/p/luke/ 46
  48. 48. 47

×