Solr Powered Lucene

Solr Powered Lucene
Erik Hatcher
Lucid Imagination
erik.hatcher@lucidimagination.com

Northern Virginia Java Users Group
December 16, 2009

1

Erik Hatcher
• Member of Technical Staff, Lucid
Imagination
• Apache Lucene/Solr Committer
• Member, Apache Software Foundation
• Co-author, Lucene in Action and Java
Development with Ant (Manning)

2

A word from our
sponsor...
• Pizza!

• Lucid Imagination

• commercial entity exclusively dedicated to Apache Lucene/
Solr open source search technology

• Services: Technical Support, Expert Link, Training, Consulting

• Tools: LucidGaze for Lucene and Solr

• Free certiﬁed distributions of Solr and Lucene

• more to come...

3

What is Solr?
• Search server
• Built upon Apache Lucene (Java)
• Fast, very
• Scalable, query load and collection size
• Interoperable
• Extensible
4

Solr Example
• Start Solr

• java -jar start.jar (Apache Solr distro)

• lucidworks start (Lucid certiﬁed distro)

• java -jar post.jar *.xml

• HTML view (via Solritas)

• easily enabled in Apache distro

• built-in to Lucid certiﬁed distro

6

Solr History
• Created by Yonik Seeley for CNET
• Contributed to Apache in January 2006
• December 2006:Version 1.
• June 2007:Version 1.2
• September 2008:Version 1.3
• November 2009:Version 1.4
7

Features
• Lucene power exposed over HTTP

• Spell checking, highlighting,
more-like-this

• Caching

• Replication

• Faceting

• Distributed search

8

Solr APIs
• HTTP GET/POST (curl or any other HTTP
client)
• JSON
• SolrJ (embedded or HTTP)
• Ruby: solr-ruby, RSolr, etc
• Many others: python, PHP, solrsharp, XSLT
9

Deployment
Architecture
• Scales from:

• single Solr server

• master/replicants(slaves)

• distributed shards

• Each Solr instance can also
have multiple cores

10

Lucene Fundamentals

11

Concepts

• Index
• Document
• Field
• Terms (aka Tokens)

12

Inverted Index
From "Taming Text" by Grant Ingersoll and Tom Morton
• Commonly used search
engine data structure

• Efﬁcient lookup of terms
across large number of
documents

• Usually stores positional
information to enable
phrase/proximity queries

13

What's in a token?

15

Lucene Scoring
d1

Θ q1

16

Relevance
• Term frequency (TF): number of times a term
appears in a document

• Inverse document frequency (IDF): One over
number of times term appears in the index (1/
df)

• Field length normalization: control affect ﬁeld
length, in number of terms, has on score

• Boost factors: terms, ﬁelds, or documents

17

Solr Core

• single primary index
• schema.xml / solrconfig.xml
• (optionally) multiple cores per Solr
instance, configured in solr.xml
• other configuration and data files

18

schema.xml
• Field types
• Fields
• Unique key (optional*)
* I've never seen a case that didn't require a
unique identifier per document
• copy fields
• similarity and Solr query parser configuration

19

Schema Analysis
• http://localhost:8983/solr/admin/analysis.jsp

• Document analysis request handler:
curl http://localhost:8983/solr/analysis/
document --data-binary @ipod_video.xml -H
'Content-type:text/xml; charset=utf-8'

• Field analysis request handler:
http://localhost:8983/solr/analysis/field?
analysis.fieldtype=text&analysis.fieldvalu
e=Foo%20Bar&q=foo&analysis.showmatch=true

20

solrconﬁg.xml
• Lucene indexing parameters
• Cache settings
• Request handler conﬁguration
• HTTP cache settings
• Search components, response writers,
query parsers

21

Request handlers
• mini-“servlets”
• SearchHandler extensions chain search
components
• Flexible response formatting:
• &wt=[json, ruby, xslt, php, phps, javabin,
python,velocity]

22

Solr XML
<add><doc>
<field name="id">MA147LL/A</field>
<field name="name">Apple 60 GB iPod with Video Playback Black</field>
<field name="manu">Apple Computer Inc.</field>
<field name="cat">electronics</field>
<field name="cat">music</field> <field name="features">iTunes, Podcasts,
Audiobooks</field>
<field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of
video</field>
<field name="features">2.5-inch, 320x240 color TFT LCD display
with LED backlight</field>
<field name="features">Up to 20 hours of battery life</field> <field
name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless,
H.264 video</field>
<field name="features">Notes, Calendar, Phone book, Hold button, Date display,
Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware,
USB 2.0 compatibility, Playback speed control, Rechargeable capability,
Battery level indication</field>
<field name="includes">earbud headphones, USB cable</field>
<field name="weight">5.5</field>
<field name="price">399.00</field>
<field name="popularity">10</field>
<field name="inStock">true</field>
</doc></add>

23

Indexing Solr XML

• Via curl:
curl 'http://localhost:8983/solr/update?
commit=true' --data-binary
@ipod_video.xml -H 'Content-type:text/
xml; charset=utf-8'

• Via Solr's Java-based post tool:
java -jar post.jar ipod_video.xml

24

Indexing CSV

curl 'http://localhost:8983/solr/update/csv?
commit=true' --data-binary @books.csv -H 'Content-
type:text/plain; charset=utf-8'

25

Content Streams
• Allows Solr server to fetch local or remote data
itself. Must enable remote streaming in
solrconﬁg.xml
• http://localhost:8983/solr/update

• ?stream.file=<local Solr path to
exampledocs>/ipod_video.xml

• ?stream.url=<url to content>

• Security warning: allows Solr to fetch arbitrary
server-side ﬁle or network URL content

26

Indexing with SolrJ
SolrServer solr =
new CommonsHttpSolrServer(
new URL("http://localhost:8983/solr"));

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "EXAMPLEDOC01");
doc.addField("title", "NOVAJUG SolrJ Example");
solr.add(doc);

solr.commit(); // after a batch, not per document

solr.optimize(); // periodically, if/when needed

27

Indexing with solr-ruby
solr = Connection.new(
'http://localhost:8983/solr',
:autocommit => :on

solr.add(:id => 123,
:title => 'Solr in Action')

solr.optimize # periodically, as needed

28

delete, update, etc
• Delete:

• <delete><id>05991</id></delete>

• <delete>
<query>category:Unused</query>
</delete>

• java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"

• Update: simply <add> doc with same unique key

• <commit/> pending documents

• <optimize/> index, squeezes out deleted documents, collapses segments

• <rollback/> to last commit point
Update commands via GET: http://localhost:8983/solr/update?stream.body=<commit/>

29

Data Import Handler

• Indexes relational database, XML data, and e-
mail sources

• Supports full and incremental/delta indexing

• Highly extensible with custom data sources,
transformers, etc

• http://wiki.apache.org/solr/DataImportHandler

30

DIH details
• Put JDBC driver JAR in <solr-home>/lib,
conﬁgure dataimport request handler
• http://localhost:8983/solr/db/admin/
dataimport.jsp - debugging console
• http://localhost:8983/solr/db/dataimport?
command=full-import - removes all
documents and imports from scratch

31

Solr Cell
• aka ExtractingRequestHandler
• leveraging Tika, extracts and indexes rich
documents such as Word, PDF, HTML, and
many other types
• curl 'http://localhost:8983/solr/update/
extract?literal.id=doc1&commit=true' -F
"myfile=@tutorial.html"

• http://wiki.apache.org/solr/
ExtractingRequestHandler

32

Standard Search
Request

• http://localhost:8983/solr/select?q=query

33

Debug Query

• &debugQuery=true is your friend
• Includes parsed query, explanations, and
search component timings in response

34

Searching
• Send GET HTTP requests
• http://localhost:8983/solr/select?
q=solr&start=0&rows=10&fl=id,name

• start: zero-based starting result
• rows: number of hits to return
• ﬂ: list of stored ﬁelds to return

35

Query Parser

• Controlled by defType parameter
• &defType=lucene (actually a Solr
extension of Lucene’s QueryParser)
• &defType=dismax
• Local {!...} override syntax

36

Solr Query Parser
• http://lucene.apache.org/java/2_9_1/
queryparsersyntax.html+ Solr extensions

• Kitchen sink parser, includes advanced user-
unfriendly syntax

• Syntax errors throw parse exceptions back to
client

• Example: title:ipod* AND price:[0 TO 100]

• http://wiki.apache.org/solr/SolrQuerySyntax

37

Dismax Query Parser
• Simplified syntax:
loose text “quote phrases” -prohibited
+required
• Spreads query terms across query fields
(qf) with dynamic boosting per field, phrase
construction (pf), and boosting query and
function capabilities (bq and bf)

38

Searching with SolrJ
SolrServer server = new
CommonsHttpSolrServer("http://localhost:8983/solr");

SolrQuery params = new SolrQuery("author:John");
params.setFields("*,score");
params.setRows(3);

QueryResponse response = server.query(params);

for (SolrDocument document : response.getResults()) {
System.out.println("Doc: " + document);
}

39

Searching with Ruby
conn = Connection.new(
'http://localhost:8983/solr')

conn.query('my query') do |hit|
puts hit.inspect
end

40

Built-in search
components
• Standard: query, facet, mlt, highlight,
stats, debug
• Others: elevation, clustering, term,
term vector

41

Faceting
• Counts per subset within results

• Facet on: ﬁeld terms, queries,
date ranges

• &facet=on
&facet.ﬁeld=cat
&facet.query=price:[0 TO 100]

SimpleFacetParameters

42

Spell checking
• http://localhost:8983/solr/spell?
q=epod&spellcheck=on&spellcheck.build
=true
• File or index-based dictionaries
• Supports pluggable distance algorithms:
Levenstein and JaroWinkler
43

Highlighting

q=apple&hl=on&hl.ﬂ=*
HighlightingParameters

44

More Like This

q=ipod&mlt=true&mlt.ﬂ=manu,cat&mlt.min
df=1&mlt.mintf=1&ﬂ=id,score,name
• http://wiki.apache.org/solr/MoreLikeThis

45

Query Elevation
• http://localhost:8983/solr/elevate?
q=ipod&debugQuery=true&enableElevation
=true
• Conﬁgure an “elevate.xml” to boost/
exclude speciﬁc documents
QueryElevationComponent

46

Clustering
• Dynamic grouping of documents into labeled
sets

• http://localhost:8983/solr/clustering?
q=*:*&rows=10

ClusteringComponent

• Requires additional steps to install (see
documentation) with Apache Solr distro

47

Terms

• Enumerates terms from specified fields
• http://localhost:8983/solr/terms?
terms.fl=name&terms.sort=index&terms.pr
efix=vi

48

Term Vectors

• Details term vector information: term
frequency, document frequency, position
and offset information
• http://localhost:8983/solr/select/?q=*
%3A*&qt=tvrh&tv=true&tv.all=true

49

stats.jsp
• Not technically a “request handler”, outputs
only XML
• http://localhost:8983/solr/admin/stats.jsp
• Index stats such as number of documents,
searcher open time
• Request handler details, number of
requests and errors, average request time,

50

Replication
• Master is polled
• Replicant pulls Lucene index and optionally
also Solr conﬁguration ﬁles
• Query throughput scaling: replicate and
load balance
• http://wiki.apache.org/solr/SolrReplication

51

Distributed Search

• Distribute documents to same-schema
shards
• Scaling for when single index becomes too
large, or a single query becomes too slow
DistributedSearch

52

What’s new in Solr 1.4?
• Java-based replication • StatsComponent

• VelocityResponseWriter • TermVectorComponent
(Solritas)
• Conﬁgurable Directory
• AJAX-Solr provider

• Logging switched to
SLF4J

• Rollback, since last
commit

53

Lucene 2.9
• IndexReader#reopen()

• Faster ﬁlter performance, by 300% in some cases

• Per-segment FieldCache

• Reusable token streams

• Faster numeric/date range queries, thanks to trie

• and tons more, see Lucene 2.9's CHANGES.txt

54

Performance
Improvements
• Caching
• Concurrent ﬁle access
• Per-segment index updates
• Faceting
• DocSet generation, avoids scoring
• Streaming updates for SolrJ
55

Feature Improvements
• Rich document • Multi-select faceting
indexing
• Speedier range
• DataImportHandler queries
enhancements
• Duplicate detection
• Smoother replication
• New request handler
• More choices for components
logging

56

Resources
• http://wiki.apache.org/solr

• solr-user@lucene.apache.org

• Lucid Imagination

• http://www.lucidimagination.com

• Articles, webinars, blogs, and...

• Search the Lucene ecosystem at:
http://search.lucidimagination.com

• support@lucidimagination.com

57

e-book now available!
print coming soon
http://www.manning.com/lucene

59

LucidWorks for Solr
• Certiﬁed Distribution

• Value-added integration

• KStemmer

• Carrot2 clustering

• LucidGaze for Solr

• installer

• Reference Manual

• Solr 1.4 certiﬁed distro coming soon!

60

LucidGaze for Solr

• Monitoring tool, captures, stores, and
interactively views Solr performance
metrics
• requests/second
• time/request

61

LucidFind

http://search.lucidimagination.com/?q=novajug

63

Solr Powered Lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Solr Powered Lucene

Similar to Solr Powered Lucene (20)

More from Erik Hatcher

More from Erik Hatcher (13)

Recently uploaded

Recently uploaded (20)

Solr Powered Lucene