SlideShare a Scribd company logo
Introduction to Search Engine-
     Building with Lucene
              Kai Chan
   SoCal Code Camp, October 2012
How to Search
• One (common) approach to searching all your
  documents:

for each document d {
  if (query is a substring of d’s content) {
    add d to the list of results
  }
}
sort the results (or not)

                                                1
How to Search
• Problems
  – Slow: Reads the whole database for each search
  – Not scalable: If your database grows by 10x, your
    search slows down by 10x
  – How to show the most relevant documents first?




                                                        2
Inverted Index
• (term -> document list) map
Documents:   T0 = "it is what it is"
             T1 = "what is it"
             T2 = "it is a banana"

Inverted     "a":        {2}
index:       "banana":   {2}
             "is":       {0, 1, 2}
             "it":       {0, 1, 2}
             "what":     {0, 1}
 E                                     3
Inverted Index
• (term -> <document, position> list) map

 T0 = "it is what it is”
       0 1 2      3 4

 T1 = "what is it”
       0    1 2

 T2 = "it is a banana”
       0 1 2 3

 E                                          4
Inverted Index
• (term -> <document, position> list) map

 T0 = "it is what it is"
 T1 = "what is it"
 T2 = "it is a banana"

 "a":        {(2,   2)}
 "banana":   {(2,   3)}
 "is":       {(0,   1), (0, 4), (1, 1), (2, 1)}
 "it":       {(0,   0), (0, 3), (1, 2), (2, 0)}
 "what":     {(0,   2), (1, 0)}

 E                                                5
Inverted Index
• Speed
  – Term list
     • Very small compared to documents’ content
     • Tends to grow at a slower speed than documents
       (after a certain level)
  – Term lookup
     • O(1) to O(log of the number of terms)
  – For a particular term:
     • Document lists: very small
     • Document + position lists: still small
  – Few terms per query

                                                        6
Inverted Index
• Relevance
  – Extra information in the index
     • Stored in a easily accessible way
     • Determine relevance of each document to the query
  – Enables sorting by (decreasing) relevance




                                                           7
Determining Relevancy
• Two models used in the searching process
  – Boolean model
     • AND, OR, NOT, etc.
     • Either a document matches a query, or not
  – Vector space model
     • How often a query term appears in a document vs.
       how often the term appears in all documents
     • Scoring and sorting by relevancy possible



                                                          8
Determining Relevancy
Lucene uses both models

     all documents


             filtering (Boolean Model)

   some documents
      (unsorted)

             scoring (Vector Space Model)

   some documents
   (sorted by score)
                                            9
Vector Space Model
f(frequency of term B)

                         document 1
                                        query


                                           document 2




                                f(frequency of term A)   10
Scoring
• Term frequency (TF)
  – How many times does this term (t) appear in this
    document (d)?
  – Score proportional to TF
• Document frequency (DF)
  – How many documents have this term (t)?
  – Score proportional to the inverse of DF (IDF)



                                                       11
Scoring
• Coordination factor (coord)
  – Documents that contains all or most query terms
    get higher scores
• Normalizing factor (norm)
  – Adjust for field length and query complexity




                                                      12
Scoring
• Boost
  – “Manual override”: ask Lucene to give a higher
    score to some particular thing
  – Index-time
     • Document
     • Field (of a particular document)
  – Search-time
     • Query



                                                     13
Scoring
                     coordination factor           query normalizing factor



       score(q, d) = coord(q, d) . queryNorm(q) .
     Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))

           term            inverse
        frequency        document
                         frequency
                                         term boost          document boost,
                                                                field boost,
                                                         length normalizing factor
http://lucene.apache.org/core/3_6_0/scoring.html                               14
Work Flow
• Indexing
  – Index: storage of inverted index + documents
  – Add fields to a document
  – Add the document to the index
  – Repeat for every document
• Searching
  – Generate a query
  – Search with this query
  – Get back a sorted document list (top N docs)

                                                   15
Adding Field to Document
• Store?
• Index?
  – Analyzed (split text into multiple terms)
  – Not analyzed (treat the whole text as ONE term)
  – Not indexed (this field will not be searchable)
  – Store norms?




                                                      16
Analyzed vs. Not Analyzed
             Text: “the quick brown fox”


Analyzed: 4 terms                Not analyzed: 1 term
1. the                           1. the quick brown fox
2. quick
3. brown
4. fox




                                                      17
Index-time Analysis
• Analyzer
  – Determine which TokenStream classes to use
• TokenStream
  – Does the actual hard work
  – Tokenizer: text to tokens
  – Token filter: tokens to tokens




                                                 18
Text:
San Franciso, the Bay Area’s city-county
http://www.ci.sf.ca.us controller@sfgov.org

WhitespaceAnalyzer:
[San] [Francisco,] [the] [Bay] [Area’s]
[city-county] [http://www.ci.sf.ca.us/]
[controller@sfgov.org]

StopAnalyzer:
[san] [francisco] [bay] [area] [s] [city] [county]
[http] [www] [ci] [sf] [ca] [us] [controller]
[sfgov] [org]

StandardAnalyzer:
[san] [francisco] [bay] [area's] [city] [county]
[http] [www.ci.fs.ca.us] [controller] [sfgov.org]
                                                     19
Notable TokenStream Classes
• ASCIIFoldingFilter
  – Converts alphabetic characters into basic forms
• PorterStemFilter
  – Reduces tokens into their stems
• SynonymTokenFilter
  – Converts words to their synonyms
• ShingleFilter
  – Creates shingles (n-grams)


                                                      20
Tokens
• Information about a token
  – Field
  – Text
  – Start offset, end offset
  – Position increment




                               21
Attributes
• Past versions of Lucene: Token object
• Recent version of Lucene: attributes
  – Efficiency, flexibility
  – Ask for attributes you want
  – Receive attribute objects
  – Use these object for information about tokens




                                                    22
create token stream
TokenStream tokenStream =
analyzer.reusableTokenStream(fieldName, reader);
tokenStream.reset();

CharTermAttribute term =                                   obtain each
stream.addAttribute(CharTermAttribute.class);              attribute you
                                                           want to know
OffsetAttribute offset =
stream.addAttribute(OffsetAttribute.class);

PositionIncrementAttribute posInc =
stream.addAttribute(PositionIncrementAttribute.class);

while (tokenStream.incrementToken()) {           go to the next token
  doSomething(term.toString(),
              offset.startOffset(),      use information about
              offset.endOffset(),        the current token
              posInc.getPositionIncrement());
}

tokenStream.end();           close token stream
tokenStream.close();                                                 23
Query-time Analysis
• Text in a query is analyzed like fields
• Use the same analyzer that analyzed the
  particular field

 +field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)



    quick    brown      fox        lazy   dog        cozy    cat


                                                                    24
Query Formation
• Query parsing
  – A query parser in core code
  – Additional query parsers in contributed code
• Or build query from the Lucene query classes




                                                   25
Term Query
• Matches documents with a particular term
  – Field
  – Text




                                             26
Term Range Query
• Matches documents with any of the terms in a
  particular range
  – Field
  – Lowest term text
  – Highest term text
  – Include lowest term text?
  – Include highest term text?



                                             27
Prefix Query
• Matches documents with any of the terms
  with a particular prefix
  – Field
  – Prefix




                                            28
Wildcard/Regex Query
• Matches documents with any of the terms
  that match a particular pattern
  – Field
  – Pattern
     • Wildcard: * for 0+ characters, ? for 0-1 character
     • Regular expression
     • Pattern matching on individual terms only




                                                            29
Fuzzy Query
• Matches documents with any of the terms
  that are “similar” to a particular term
  – Levenshtein distance (“edit distance”):
    Number of character insertions, deletions or
    substitutions needed to transform one string into
    another
     • e.g. kitten -> sitten -> sittin -> sitting (3 edits)
  – Field
  – Text
  – Minimum similarity score

                                                              30
Phrase Query
• Matches documents with all the given words
  present and being “near” each other
  – Field
  – Terms
  – Slop
     • Number of “moves of words” permitted
     • Slop = 0 means exact phrase match required




                                                    31
Boolean Query
• Conceptually similar to boolean operators
  (“AND”, “OR”, “NOT”), but not identical
• Why Not AND, OR, And NOT?
  – http://www.lucidimagination.com/blog/2011/12/
    28/why-not-and-or-and-not/
  – In short, boolean operators do not handle > 2
    clauses well



                                                32
Boolean Query
• Three types of clauses
  – Must
  – Should
  – Must not
• For a boolean query to match a document
  – All “must” clauses must match
  – All “must not” clauses must not match
  – At least one “must” or “should” clause must
    match

                                                  33
Span Query
• Asks Lucene not only what documents the
  query matches, but also where it matches
  (“spans”)
• Span
  – Particular parts or places in a document
  – <document ID, start position, end position> tuple




                                                        34
T0 = "it is what it is”
      0 1 2      3 4

T1 = "what is it”
      0    1 2

T2 = "it is a banana”
      0 1 2 3


           <doc ID, start pos., end pos.>
“it is”:   <0,      0,          2>
           <0,      3,          5>
           <2,      0,          2>
                                        35
Span Query
• SpanTermQuery
  – Same as TermQuery, except your can build other
    span queries with it
• SpanOrQuery
  – Matches spans that are matched by any of some
    span queries
• SpanNotQuery
  – Matches spans that are matched by one span
    query but not the other span query

                                                     36
spanTerm(apple)                 spanOr([apple, orange])


apple                  orange   apple                  orange




        spanTerm(orange)            spanNot(apple, orange)



                                                         37
Span Query
• SpanNearQuery
  – Matches spans that are within a certain distance
    (“slop”) of each other
  – Slop: max number of positions between spans
  – Can specify whether order matters




                                                       38
the                quick           brown       fox

                     2                 1             0

1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false)        ✔

2. spanNear([brown, fox, the, quick], slop = 3, inOrder = false)        ✔

3. spanNear([brown, fox, the, quick], slop = 2, inOrder = false)        ✖

4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true)         ✖

5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true)         ✔


                                                                   39
Filtering
• A Filter narrows down the search result
  – Creates a set of document IDs
  – Decides what documents get processed further
  – Does not affect scoring, i.e. does not score/rank
    documents that pass the filter
  – Can be cached easily
  – Useful for access control, presets, etc.



                                                        40
Notable Filter classes
• TermsFilter
   – Allows documents with any of the given terms
• TermRangeFilter
   – Filter version of TermRangeQuery
• PrefixFilter
   – Filter version of PrefixQuery
• QueryWrapperFilter
   – “Adapts” a query into a filter
• CachingWrapperFilter
   – Cache the result of the wrapped filter

                                                    41
Sorting
• Score (default)
• Index order
• Field
  – Requires the field be indexed & not analyzed
  – Specify type (string, int, etc.)
  – Normal or reverse order
  – Single or multiple fields


                                                   42
Interfacing Lucene with “Outside”
• Embedding directly
• Language bridge
  – E.g. PHP/Java Bridge
• Web service
  – E.g. Jetty + your own request handler
• Solr
  – Lucene + Jetty + lots of useful functionality


                                                    43
Books
• Lucene in Action, 2nd Edition
  – Written by 3 committers and PMC members
  – http://www.manning.com/hatcher3/
• Introduction to Information Retrieval
  – Not specific to Lucene, but about IR concepts
  – Free e-book
  – http://nlp.stanford.edu/IR-book/


                                                    44
Web Resources
• Official Website
   – http://lucene.apache.org/
• Tutorial with sample code
   – http://www.lucenetutorial.com/lucene-in-5-minutes.html
• StackOverflow
   – http://stackoverflow.com/questions/tagged/lucene
• Mailing lists
   – http://lucene.apache.org/core/discussion.html
• Blogs
   – http://www.lucidimagination.com/blog/
   – http://blog.mikemccandless.com/
   – http://lucene.grantingersoll.com/

                                                              45
Getting Started
• Getting started
  – Download lucene-3.6.1.zip (or .tgz)
  – Add lucene-core-3.6.1.jar to your classpath
  – Consider using an IDE (e.g. Eclipse)
  – Luke (Lucene Index Toolbox)
    http://code.google.com/p/luke/




                                                  46
47

More Related Content

What's hot

Vu Semantic Web Meeting 20091123
Vu Semantic Web Meeting 20091123Vu Semantic Web Meeting 20091123
Vu Semantic Web Meeting 20091123Rinke Hoekstra
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
Damiano Spina
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
Yong Joon Moon
 
Perl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20PerlPerl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20Perltutorialsruby
 
Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
Kostis Kyzirakos
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
fridolin.wild
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Takeshi Morita
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
Lucidworks
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 

What's hot (13)

NLTK
NLTKNLTK
NLTK
 
Vu Semantic Web Meeting 20091123
Vu Semantic Web Meeting 20091123Vu Semantic Web Meeting 20091123
Vu Semantic Web Meeting 20091123
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
 
Perl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20PerlPerl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20Perl
 
Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 

Similar to Introduction to search engine-building with Lucene

Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
Sylvain Utard
 
Search pitb
Search pitbSearch pitb
Search pitb
Nawab Iqbal
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
MedinaBedru
 
Text features
Text featuresText features
Text features
Shruti kar
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Lucidworks
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
search.ppt
search.pptsearch.ppt
search.ppt
Pikaj2
 

Similar to Introduction to search engine-building with Lucene (20)

Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
search engine
search enginesearch engine
search engine
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
Text features
Text featuresText features
Text features
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Ir models
Ir modelsIr models
Ir models
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
search.ppt
search.pptsearch.ppt
search.ppt
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Introduction to search engine-building with Lucene

  • 1. Introduction to Search Engine- Building with Lucene Kai Chan SoCal Code Camp, October 2012
  • 2. How to Search • One (common) approach to searching all your documents: for each document d { if (query is a substring of d’s content) { add d to the list of results } } sort the results (or not) 1
  • 3. How to Search • Problems – Slow: Reads the whole database for each search – Not scalable: If your database grows by 10x, your search slows down by 10x – How to show the most relevant documents first? 2
  • 4. Inverted Index • (term -> document list) map Documents: T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana" Inverted "a": {2} index: "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} E 3
  • 5. Inverted Index • (term -> <document, position> list) map T0 = "it is what it is” 0 1 2 3 4 T1 = "what is it” 0 1 2 T2 = "it is a banana” 0 1 2 3 E 4
  • 6. Inverted Index • (term -> <document, position> list) map T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana" "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} E 5
  • 7. Inverted Index • Speed – Term list • Very small compared to documents’ content • Tends to grow at a slower speed than documents (after a certain level) – Term lookup • O(1) to O(log of the number of terms) – For a particular term: • Document lists: very small • Document + position lists: still small – Few terms per query 6
  • 8. Inverted Index • Relevance – Extra information in the index • Stored in a easily accessible way • Determine relevance of each document to the query – Enables sorting by (decreasing) relevance 7
  • 9. Determining Relevancy • Two models used in the searching process – Boolean model • AND, OR, NOT, etc. • Either a document matches a query, or not – Vector space model • How often a query term appears in a document vs. how often the term appears in all documents • Scoring and sorting by relevancy possible 8
  • 10. Determining Relevancy Lucene uses both models all documents filtering (Boolean Model) some documents (unsorted) scoring (Vector Space Model) some documents (sorted by score) 9
  • 11. Vector Space Model f(frequency of term B) document 1 query document 2 f(frequency of term A) 10
  • 12. Scoring • Term frequency (TF) – How many times does this term (t) appear in this document (d)? – Score proportional to TF • Document frequency (DF) – How many documents have this term (t)? – Score proportional to the inverse of DF (IDF) 11
  • 13. Scoring • Coordination factor (coord) – Documents that contains all or most query terms get higher scores • Normalizing factor (norm) – Adjust for field length and query complexity 12
  • 14. Scoring • Boost – “Manual override”: ask Lucene to give a higher score to some particular thing – Index-time • Document • Field (of a particular document) – Search-time • Query 13
  • 15. Scoring coordination factor query normalizing factor score(q, d) = coord(q, d) . queryNorm(q) . Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d)) term inverse frequency document frequency term boost document boost, field boost, length normalizing factor http://lucene.apache.org/core/3_6_0/scoring.html 14
  • 16. Work Flow • Indexing – Index: storage of inverted index + documents – Add fields to a document – Add the document to the index – Repeat for every document • Searching – Generate a query – Search with this query – Get back a sorted document list (top N docs) 15
  • 17. Adding Field to Document • Store? • Index? – Analyzed (split text into multiple terms) – Not analyzed (treat the whole text as ONE term) – Not indexed (this field will not be searchable) – Store norms? 16
  • 18. Analyzed vs. Not Analyzed Text: “the quick brown fox” Analyzed: 4 terms Not analyzed: 1 term 1. the 1. the quick brown fox 2. quick 3. brown 4. fox 17
  • 19. Index-time Analysis • Analyzer – Determine which TokenStream classes to use • TokenStream – Does the actual hard work – Tokenizer: text to tokens – Token filter: tokens to tokens 18
  • 20. Text: San Franciso, the Bay Area’s city-county http://www.ci.sf.ca.us controller@sfgov.org WhitespaceAnalyzer: [San] [Francisco,] [the] [Bay] [Area’s] [city-county] [http://www.ci.sf.ca.us/] [controller@sfgov.org] StopAnalyzer: [san] [francisco] [bay] [area] [s] [city] [county] [http] [www] [ci] [sf] [ca] [us] [controller] [sfgov] [org] StandardAnalyzer: [san] [francisco] [bay] [area's] [city] [county] [http] [www.ci.fs.ca.us] [controller] [sfgov.org] 19
  • 21. Notable TokenStream Classes • ASCIIFoldingFilter – Converts alphabetic characters into basic forms • PorterStemFilter – Reduces tokens into their stems • SynonymTokenFilter – Converts words to their synonyms • ShingleFilter – Creates shingles (n-grams) 20
  • 22. Tokens • Information about a token – Field – Text – Start offset, end offset – Position increment 21
  • 23. Attributes • Past versions of Lucene: Token object • Recent version of Lucene: attributes – Efficiency, flexibility – Ask for attributes you want – Receive attribute objects – Use these object for information about tokens 22
  • 24. create token stream TokenStream tokenStream = analyzer.reusableTokenStream(fieldName, reader); tokenStream.reset(); CharTermAttribute term = obtain each stream.addAttribute(CharTermAttribute.class); attribute you want to know OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class); PositionIncrementAttribute posInc = stream.addAttribute(PositionIncrementAttribute.class); while (tokenStream.incrementToken()) { go to the next token doSomething(term.toString(), offset.startOffset(), use information about offset.endOffset(), the current token posInc.getPositionIncrement()); } tokenStream.end(); close token stream tokenStream.close(); 23
  • 25. Query-time Analysis • Text in a query is analyzed like fields • Use the same analyzer that analyzed the particular field +field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”) quick brown fox lazy dog cozy cat 24
  • 26. Query Formation • Query parsing – A query parser in core code – Additional query parsers in contributed code • Or build query from the Lucene query classes 25
  • 27. Term Query • Matches documents with a particular term – Field – Text 26
  • 28. Term Range Query • Matches documents with any of the terms in a particular range – Field – Lowest term text – Highest term text – Include lowest term text? – Include highest term text? 27
  • 29. Prefix Query • Matches documents with any of the terms with a particular prefix – Field – Prefix 28
  • 30. Wildcard/Regex Query • Matches documents with any of the terms that match a particular pattern – Field – Pattern • Wildcard: * for 0+ characters, ? for 0-1 character • Regular expression • Pattern matching on individual terms only 29
  • 31. Fuzzy Query • Matches documents with any of the terms that are “similar” to a particular term – Levenshtein distance (“edit distance”): Number of character insertions, deletions or substitutions needed to transform one string into another • e.g. kitten -> sitten -> sittin -> sitting (3 edits) – Field – Text – Minimum similarity score 30
  • 32. Phrase Query • Matches documents with all the given words present and being “near” each other – Field – Terms – Slop • Number of “moves of words” permitted • Slop = 0 means exact phrase match required 31
  • 33. Boolean Query • Conceptually similar to boolean operators (“AND”, “OR”, “NOT”), but not identical • Why Not AND, OR, And NOT? – http://www.lucidimagination.com/blog/2011/12/ 28/why-not-and-or-and-not/ – In short, boolean operators do not handle > 2 clauses well 32
  • 34. Boolean Query • Three types of clauses – Must – Should – Must not • For a boolean query to match a document – All “must” clauses must match – All “must not” clauses must not match – At least one “must” or “should” clause must match 33
  • 35. Span Query • Asks Lucene not only what documents the query matches, but also where it matches (“spans”) • Span – Particular parts or places in a document – <document ID, start position, end position> tuple 34
  • 36. T0 = "it is what it is” 0 1 2 3 4 T1 = "what is it” 0 1 2 T2 = "it is a banana” 0 1 2 3 <doc ID, start pos., end pos.> “it is”: <0, 0, 2> <0, 3, 5> <2, 0, 2> 35
  • 37. Span Query • SpanTermQuery – Same as TermQuery, except your can build other span queries with it • SpanOrQuery – Matches spans that are matched by any of some span queries • SpanNotQuery – Matches spans that are matched by one span query but not the other span query 36
  • 38. spanTerm(apple) spanOr([apple, orange]) apple orange apple orange spanTerm(orange) spanNot(apple, orange) 37
  • 39. Span Query • SpanNearQuery – Matches spans that are within a certain distance (“slop”) of each other – Slop: max number of positions between spans – Can specify whether order matters 38
  • 40. the quick brown fox 2 1 0 1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false) ✔ 2. spanNear([brown, fox, the, quick], slop = 3, inOrder = false) ✔ 3. spanNear([brown, fox, the, quick], slop = 2, inOrder = false) ✖ 4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true) ✖ 5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true) ✔ 39
  • 41. Filtering • A Filter narrows down the search result – Creates a set of document IDs – Decides what documents get processed further – Does not affect scoring, i.e. does not score/rank documents that pass the filter – Can be cached easily – Useful for access control, presets, etc. 40
  • 42. Notable Filter classes • TermsFilter – Allows documents with any of the given terms • TermRangeFilter – Filter version of TermRangeQuery • PrefixFilter – Filter version of PrefixQuery • QueryWrapperFilter – “Adapts” a query into a filter • CachingWrapperFilter – Cache the result of the wrapped filter 41
  • 43. Sorting • Score (default) • Index order • Field – Requires the field be indexed & not analyzed – Specify type (string, int, etc.) – Normal or reverse order – Single or multiple fields 42
  • 44. Interfacing Lucene with “Outside” • Embedding directly • Language bridge – E.g. PHP/Java Bridge • Web service – E.g. Jetty + your own request handler • Solr – Lucene + Jetty + lots of useful functionality 43
  • 45. Books • Lucene in Action, 2nd Edition – Written by 3 committers and PMC members – http://www.manning.com/hatcher3/ • Introduction to Information Retrieval – Not specific to Lucene, but about IR concepts – Free e-book – http://nlp.stanford.edu/IR-book/ 44
  • 46. Web Resources • Official Website – http://lucene.apache.org/ • Tutorial with sample code – http://www.lucenetutorial.com/lucene-in-5-minutes.html • StackOverflow – http://stackoverflow.com/questions/tagged/lucene • Mailing lists – http://lucene.apache.org/core/discussion.html • Blogs – http://www.lucidimagination.com/blog/ – http://blog.mikemccandless.com/ – http://lucene.grantingersoll.com/ 45
  • 47. Getting Started • Getting started – Download lucene-3.6.1.zip (or .tgz) – Add lucene-core-3.6.1.jar to your classpath – Consider using an IDE (e.g. Eclipse) – Luke (Lucene Index Toolbox) http://code.google.com/p/luke/ 46
  • 48. 47

Editor's Notes

  1. I bet this is exactly how many systems are handling search right now.Perhaps many systems do not think about how to sort the result and just throws back the result list to the user, without considering what should go first.
  2. Image the slowdown if your website goes from &quot;nobody besides our employees and friends use it&quot; to being &quot;the next FaceBook”.People loose interest in your application easily,if the first few things your search result present do not look exactly like what they are trying to find.
  3. Expand onthe inverted index we just saw.Positions start with zero.
  4. There are only so many words that people commonly use.You can hash the terms, organize them as a prefix tree, sort them and use binary search, and so on.For the purpose of deciding which documents match, you only need to store document IDs (integers).
  5. Extra info: determine how good of a match a document is to a query.Put the best matches near the topof the search result list.
  6. The highest-scored (most relevant) document is the first in the result list.
  7. In VSM, documents and queries are presented as vectors in an n-dimensional space, where n is the total number of unique terms in the document collection, and each dimension corresponds to a separate term. A vector&apos;s value in a particular dimension is not zero if the document or the query contains that term.Document vector closer to query vector = document more relevant to the query
  8. The term might be a common word that appears everywhere.
  9. Existence of the index can help with the search, but the index must be created in the first place before we can search with it.
  10. Storing the field means that the original text is stored in the index; can retrieve it at search time.Indexing the fieldmeans that the field is made searchable.
  11. Some fields (e.g. serial numbers) should not be analyzed, as they contain information that cannot be logically broken into pieces.
  12. Token = term, at index time, with start/end position information, and not tied to a document already in the index.
  13. Case-sensitivity, punctuations, apostrophes, how to break URLs and e-mail addressesWhat needs to be kept one-piece or broken down, and whereWhitespaceAnalyzer:whitespaces as separators;punctuations are a part of tokens. StopAnalyzer: non-letters as separators; makes everything lowercase; removes common stop-words like &quot;the”.StandardAnalyzer:sophisticated rules to handle punctuations, hyphens, etc.; recognizes (and avoids breaking up) e-mail addresses and internet hostnames.
  14. Character folding: turns the &quot;a&quot; with an accent mark above into an &quot;a&quot; without the accent markStemming: the words &quot;consistent&quot; and &quot;consistency&quot; have the same stem, which is &quot;consist”Synonyms: like &quot;country&quot; and &quot;nation”Shingles: “the quick”, “the brown”, “brown fox”; useful for searching text in Asian languages like Chinese and Japanese; reduces the number of unique terms in an index and reduces overhead.
  15. Offsets: character offsets of this token from the beginning of the field&apos;s textPosition increment: position of this token relative to the previous token; usually 1
  16. This query have clauses about 3 fields. So you analyze 3 pieces of text and get back 3 sets of tokens.A good practice is to use the same analyzer that analyzed the particular field that you are searching.
  17. Examples of range:January 1st to December 31st of 2012 (inclusive)1 to 10 (excluding 10)
  18. Your pattern describe a term, not a document, so you cannot put a phrase or a sentence in a pattern and expect the query to match that phrase or sentence.
  19. Minimum similarity score isbased on the editing distance.
  20. It takes two moves to swap two words in a phrase.
  21. Lucene does not have the standard boolean operators.
  22. Lucene has these instead (of the “standard” boolean operators).
  23. End position is actually one plus the position of the last term in the span
  24. This &quot;slop&quot; is different from the &quot;slop&quot; in Phrase Query.
  25. total number of positions between spans = 2 + 1 + 0 = 3The first two queries match this document because the slops are at least 3. The third query does not match, because the slope is less than 3. The fourth query does not match because even though the required slop is large enough, the query require all the spans to be in the given order, and the spans in this document are not. The fifth query matches because the given order matches the order of the spans in the document.
  26. CachingWrapperFilter good for filters that don’t change a lot, e.g. access restriction.
  27. Index order = order in which docs are added to the indexIndex and not analyzed = whole field as one token/term
  28. Embedding directly: good when the rest of your application is also in Java.In most uses cases, you would be dealing with Solr rather than Lucene directly. But you would still be indirectly using Lucene, and you can still benefit from understanding many of the things discussed in this session.
  29. Eclipse has many useful features such as setting up the classpath and compiling your code for you.Website has Lucene 3 and 4. Lucene 4 is still in beta. The book and most resources out there covers Lucene 3.
  30. It shows you what your index looks like and what fields and terms it has. You can look at individual documents, run queries, try out different analyzers.