Lucene intro

About Me

• Cristian Vat

• Java Developer / Geek / Enthusiast

• Contact

• @deathy

• ... or TM JUG mailing list

About YOU

• Heard about Lucene / Solr ?

• Used Lucene / Solr ?

Databases

• Select/Search on (usually) exact values or
ranges

• Group/Summarize Results

• Sort results by value(s) of certain result
column(s)

Text Search

• Search for individual words/tokens

• Search long text documents

• More language-aware

• “Sorting” by Relevance of results by default

IR Quick Intro

• Doc 1: “I did enact Julius Caesar: I was killed i’
the Capitol; Brutus killed me.”

• Doc 2: “So let it be with Caesar. The noble
Brutus hath told you Caesar was ambitious:”

IR Quick Intro

• Index

• “I” -> Doc 1

• “Caesar” -> Doc 1, Doc 2

• “enact” -> Doc 1

• “noble” -> Doc 1

IR Quick Intro
• Search

• caesar

• c?es*

• caesar AND noble

• “Julius Caesar”

• Caesar NOT Brutus

Lucene Ecosystem

...and many more

Lucene

• IR Library

• Just API for Indexing/Searching

• No GUI

• No parsers for different ﬁle formats

Lucene

• Fast

• Thread-Safe/Multi-Threaded indexing and
searching

• No dependencies! (not even logging
framework)

Solr

• Search Server / Layer over Lucene

• Provides REST-like HTTP (JSON/XML) API

• Client libraries in Java, PHP, Python, Ruby,
Perl, .NET, ...

Solr

• More structured indexes

• Replication / Distribution, Master-Slave, etc.

• Faceted Search / Filtering

• Indexing of rich document types (via Tika)

Tika

• “Content Analysis Toolkit”

• Text and Metadata extraction from various
rich document types

• Used by Solr for indexing rich document
types

Lucene Index Structure

• Index = One or more Documents

• Document = one or more Fields with values

• NO Schema/Structure restrictions

Query Parser
• AND, OR, NOT ( +/- )

• “apache AND lucene NOT solr” ( “+apache
+lucene -solr” )

• Range Queries

• year:[1994 TO 2011]

• Wildcard/Fuzzy:

• “ap?che”, “apac*”, “appche”˜0.8

Sorting or Results

• Default sort by Relevance

• Possible to use custom sort ﬁelds

Relevance

• Score is calculated for each document based
on individual document/ﬁelds and the current
search query

For the nerds

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html

Analysis

• From long continuous text to individual
tokens/words used for indexing

Analysis

• Text -> Tokenizer -> (TokenFilter)* -> Tokens

Tokenizer

• Splits main text into words, by whitespace,
punctuation, other rules

• Text: “So, it has come to this!”

• Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]

Token Filters

• Change existing tokens or add new ones

• Case-Folding

• Synonyms

• Stemming

Token Filters
• Text: “The Pandorica was constructed to
ensure the safety of the Alliance.”

• Tokens: [“The”, “Pandorica”, “was”,
“constructed”, “to”, “ensure”, “the”, “safety”,
“of”, “the”, “Alliance” ]

• Filtered: [ “pandorica”, “was”, “construct”,
“to”, “ensure”, “safe”, “of”, “alliance” ]

Lucene intro

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lucene intro

Similar to Lucene intro (20)

More from Cristian Vat

More from Cristian Vat (6)

Recently uploaded

Recently uploaded (20)

Lucene intro

Editor's Notes