Introduction to search engine-building with Lucene

Introduction to Search Engine-
Building with Lucene
Kai Chan
SoCal Code Camp, October 2012

How to Search
• One (common) approach to searching all your
documents:

for each document d {
if (query is a substring of d’s content) {
add d to the list of results
}
}
sort the results (or not)

1

How to Search
• Problems
– Slow: Reads the whole database for each search
– Not scalable: If your database grows by 10x, your
search slows down by 10x
– How to show the most relevant documents first?

2

Inverted Index
• (term -> document list) map
Documents: T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"

Inverted "a": {2}
index: "banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
E 3

Inverted Index
• (term -> <document, position> list) map

T0 = "it is what it is”
0 1 2 3 4

T1 = "what is it”
0 1 2

T2 = "it is a banana”
0 1 2 3

E 4

Inverted Index
• (term -> <document, position> list) map

T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"

"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}

E 5

Inverted Index
• Speed
– Term list
• Very small compared to documents’ content
• Tends to grow at a slower speed than documents
(after a certain level)
– Term lookup
• O(1) to O(log of the number of terms)
– For a particular term:
• Document lists: very small
• Document + position lists: still small
– Few terms per query

6

Inverted Index
• Relevance
– Extra information in the index
• Stored in a easily accessible way
• Determine relevance of each document to the query
– Enables sorting by (decreasing) relevance

7

Determining Relevancy
• Two models used in the searching process
– Boolean model
• AND, OR, NOT, etc.
• Either a document matches a query, or not
– Vector space model
• How often a query term appears in a document vs.
how often the term appears in all documents
• Scoring and sorting by relevancy possible

8

Determining Relevancy
Lucene uses both models

all documents

filtering (Boolean Model)

some documents
(unsorted)

scoring (Vector Space Model)

some documents
(sorted by score)
9

Vector Space Model
f(frequency of term B)

document 1
query

document 2

f(frequency of term A) 10

Scoring
• Term frequency (TF)
– How many times does this term (t) appear in this
document (d)?
– Score proportional to TF
• Document frequency (DF)
– How many documents have this term (t)?
– Score proportional to the inverse of DF (IDF)

11

Scoring
• Coordination factor (coord)
– Documents that contains all or most query terms
get higher scores
• Normalizing factor (norm)
– Adjust for field length and query complexity

12

Scoring
• Boost
– “Manual override”: ask Lucene to give a higher
score to some particular thing
– Index-time
• Document
• Field (of a particular document)
– Search-time
• Query

13

Scoring
coordination factor query normalizing factor

score(q, d) = coord(q, d) . queryNorm(q) .
Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))

term inverse
frequency document
frequency
term boost document boost,
field boost,
length normalizing factor
http://lucene.apache.org/core/3_6_0/scoring.html 14

Work Flow
• Indexing
– Index: storage of inverted index + documents
– Add fields to a document
– Add the document to the index
– Repeat for every document
• Searching
– Generate a query
– Search with this query
– Get back a sorted document list (top N docs)

15

Adding Field to Document
• Store?
• Index?
– Analyzed (split text into multiple terms)
– Not analyzed (treat the whole text as ONE term)
– Not indexed (this field will not be searchable)
– Store norms?

16

Analyzed vs. Not Analyzed
Text: “the quick brown fox”

Analyzed: 4 terms Not analyzed: 1 term
1. the 1. the quick brown fox
2. quick
3. brown
4. fox

17

Index-time Analysis
• Analyzer
– Determine which TokenStream classes to use
• TokenStream
– Does the actual hard work
– Tokenizer: text to tokens
– Token filter: tokens to tokens

18

Text:
San Franciso, the Bay Area’s city-county
http://www.ci.sf.ca.us controller@sfgov.org

WhitespaceAnalyzer:
[San] [Francisco,] [the] [Bay] [Area’s]
[city-county] [http://www.ci.sf.ca.us/]
[controller@sfgov.org]

StopAnalyzer:
[san] [francisco] [bay] [area] [s] [city] [county]
[http] [www] [ci] [sf] [ca] [us] [controller]
[sfgov] [org]

StandardAnalyzer:
[san] [francisco] [bay] [area's] [city] [county]
[http] [www.ci.fs.ca.us] [controller] [sfgov.org]
19

Notable TokenStream Classes
• ASCIIFoldingFilter
– Converts alphabetic characters into basic forms
• PorterStemFilter
– Reduces tokens into their stems
• SynonymTokenFilter
– Converts words to their synonyms
• ShingleFilter
– Creates shingles (n-grams)

20

Tokens
• Information about a token
– Field
– Text
– Start offset, end offset
– Position increment

21

Attributes
• Past versions of Lucene: Token object
• Recent version of Lucene: attributes
– Efficiency, flexibility
– Ask for attributes you want
– Receive attribute objects
– Use these object for information about tokens

22

create token stream
TokenStream tokenStream =
analyzer.reusableTokenStream(fieldName, reader);
tokenStream.reset();

CharTermAttribute term = obtain each
stream.addAttribute(CharTermAttribute.class); attribute you
want to know
OffsetAttribute offset =
stream.addAttribute(OffsetAttribute.class);

PositionIncrementAttribute posInc =
stream.addAttribute(PositionIncrementAttribute.class);

while (tokenStream.incrementToken()) { go to the next token
doSomething(term.toString(),
offset.startOffset(), use information about
offset.endOffset(), the current token
posInc.getPositionIncrement());
}

tokenStream.end(); close token stream
tokenStream.close(); 23

Query-time Analysis
• Text in a query is analyzed like fields
• Use the same analyzer that analyzed the
particular field

+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)

quick brown fox lazy dog cozy cat

24

Query Formation
• Query parsing
– A query parser in core code
– Additional query parsers in contributed code
• Or build query from the Lucene query classes

25

Term Query
• Matches documents with a particular term
– Field
– Text

26

Term Range Query
• Matches documents with any of the terms in a
particular range
– Field
– Lowest term text
– Highest term text
– Include lowest term text?
– Include highest term text?

27

Prefix Query
• Matches documents with any of the terms
with a particular prefix
– Field
– Prefix

28

Wildcard/Regex Query
that match a particular pattern
– Field
– Pattern
• Wildcard: * for 0+ characters, ? for 0-1 character
• Regular expression
• Pattern matching on individual terms only

29

Fuzzy Query
that are “similar” to a particular term
– Levenshtein distance (“edit distance”):
Number of character insertions, deletions or
substitutions needed to transform one string into
another
• e.g. kitten -> sitten -> sittin -> sitting (3 edits)
– Field
– Text
– Minimum similarity score

30

Phrase Query
• Matches documents with all the given words
present and being “near” each other
– Field
– Terms
– Slop
• Number of “moves of words” permitted
• Slop = 0 means exact phrase match required

31

Boolean Query
• Conceptually similar to boolean operators
(“AND”, “OR”, “NOT”), but not identical
• Why Not AND, OR, And NOT?
– http://www.lucidimagination.com/blog/2011/12/
28/why-not-and-or-and-not/
– In short, boolean operators do not handle > 2
clauses well

32

Boolean Query
• Three types of clauses
– Must
– Should
– Must not
• For a boolean query to match a document
– All “must” clauses must match
– All “must not” clauses must not match
– At least one “must” or “should” clause must
match

33

Span Query
• Asks Lucene not only what documents the
query matches, but also where it matches
(“spans”)
• Span
– Particular parts or places in a document
– <document ID, start position, end position> tuple

34

T0 = "it is what it is”
0 1 2 3 4

T1 = "what is it”
0 1 2

T2 = "it is a banana”
0 1 2 3

<doc ID, start pos., end pos.>
“it is”: <0, 0, 2>
<0, 3, 5>
<2, 0, 2>
35

Span Query
• SpanTermQuery
– Same as TermQuery, except your can build other
span queries with it
• SpanOrQuery
– Matches spans that are matched by any of some
span queries
• SpanNotQuery
– Matches spans that are matched by one span
query but not the other span query

36

spanTerm(apple) spanOr([apple, orange])

apple orange apple orange

spanTerm(orange) spanNot(apple, orange)

37

Span Query
• SpanNearQuery
– Matches spans that are within a certain distance
(“slop”) of each other
– Slop: max number of positions between spans
– Can specify whether order matters

38

the quick brown fox

2 1 0

1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false) ✔

2. spanNear([brown, fox, the, quick], slop = 3, inOrder = false) ✔

3. spanNear([brown, fox, the, quick], slop = 2, inOrder = false) ✖

4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true) ✖

5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true) ✔

39

Filtering
• A Filter narrows down the search result
– Creates a set of document IDs
– Decides what documents get processed further
– Does not affect scoring, i.e. does not score/rank
documents that pass the filter
– Can be cached easily
– Useful for access control, presets, etc.

40

Notable Filter classes
• TermsFilter
– Allows documents with any of the given terms
• TermRangeFilter
– Filter version of TermRangeQuery
• PrefixFilter
– Filter version of PrefixQuery
• QueryWrapperFilter
– “Adapts” a query into a filter
• CachingWrapperFilter
– Cache the result of the wrapped filter

41

Sorting
• Score (default)
• Index order
• Field
– Requires the field be indexed & not analyzed
– Specify type (string, int, etc.)
– Normal or reverse order
– Single or multiple fields

42

Interfacing Lucene with “Outside”
• Embedding directly
• Language bridge
– E.g. PHP/Java Bridge
• Web service
– E.g. Jetty + your own request handler
• Solr
– Lucene + Jetty + lots of useful functionality

43

Books
• Lucene in Action, 2nd Edition
– Written by 3 committers and PMC members
– http://www.manning.com/hatcher3/
• Introduction to Information Retrieval
– Not specific to Lucene, but about IR concepts
– Free e-book
– http://nlp.stanford.edu/IR-book/

44

Web Resources
• Official Website
– http://lucene.apache.org/
• Tutorial with sample code
– http://www.lucenetutorial.com/lucene-in-5-minutes.html
• StackOverflow
– http://stackoverflow.com/questions/tagged/lucene
• Mailing lists
– http://lucene.apache.org/core/discussion.html
• Blogs
– http://www.lucidimagination.com/blog/
– http://blog.mikemccandless.com/
– http://lucene.grantingersoll.com/

45

Getting Started
• Getting started
– Download lucene-3.6.1.zip (or .tgz)
– Add lucene-core-3.6.1.jar to your classpath
– Consider using an IDE (e.g. Eclipse)
– Luke (Lucene Index Toolbox)
http://code.google.com/p/luke/

46

Introduction to search engine-building with Lucene

More Related Content

What's hot

Similar to Introduction to search engine-building with Lucene

Recently uploaded

Introduction to search engine-building with Lucene

Editor's Notes