More Related Content
Similar to Solr: Search at the Speed of Light (20)
More from Erik Hatcher (20)
Solr: Search at the Speed of Light
- 1. Solr
Search at the Speed of Light
JavaZone 2009
September 10
Oslo
Erik Hatcher, Lucid Imagination
erik.hatcher@lucidimagination.com
1
- 2. Solr History
• Created by Yonik Seeley for CNET
• Contributed to Apache in January 2006
• December 2006:Version 1.1 released
• June 2007:Version 1.2 released
• September 2008:Version 1.3 released
• ~September 2009:Version 1.4
http://lucene.apache.org/solr
© 2008-2009 Lucid Imagination, Inc.
2
- 3. Solr: Big Picture
Data
DB
Document
Document
Documents
Solr
Search Results
© 2008-2009 Lucid Imagination, Inc.
3
- 4. Features
• Lucene power exposed over HTTP
• Scalability: caching, replication, distributed
search
• Faceting
• And more: spell checking, highlighting,
clustering, rich document and DB indexing,
"more like this"
© 2008-2009 Lucid Imagination, Inc.
4
- 5. Lucene
• Fast, scalable search library
• Lucene index structure
• Index contains documents
• documents have fields
• indexed fields have terms
© 2008-2009 Lucid Imagination, Inc.
5
- 6. Inverted Index
• Commonly used search
engine data structure
• Efficient lookup of terms
across large number of
documents
• Usually stores positional
information to enable From "Taming Text" by Grant Ingersoll and Tom Morton
phrase/proximity queries
© 2008-2009 Lucid Imagination, Inc.
6
- 8. Analyzing the analyzer
Example phrase
The quick brown fox jumps over the lazy dog.
© 2008-2009 Lucid Imagination, Inc.
8
- 9. WhitespaceAnalyzer
Simplest built-in analyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the]
[lazy] [dog.]
© 2008-2009 Lucid Imagination, Inc.
9
- 10. SimpleAnalyzer
Lowercases, splits at non-letter boundaries
the quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the]
[lazy] [dog]
© 2008-2009 Lucid Imagination, Inc.
10
- 11. StopAnalyzer
Lowercases and removes stop words
The quick brown fox jumps over the lazy dog.
[quick] [brown] [fox] [jumps] [over] [lazy] [dog]
© 2008-2009 Lucid Imagination, Inc.
11
- 12. SnowballAnalyzer
Stemming algorithm
The quick brown fox jumps over the lazi dog.
[the] [quick] [brown] [fox] [jump] [over] [the]
[lazi] [dog]
© 2008-2009 Lucid Imagination, Inc.
12
- 13. What's in a token?
© 2008-2009 Lucid Imagination, Inc.
13
- 14. Relevance
• Term frequency (TF): number of times a term
appears in a document
• Inverse document frequency (IDF): One over
number of times term appears in the index (1/df)
• Field length normalization: control affect field
length, in number of terms, has on score
• Boost factors: terms, fields, or documents
© 2008-2009 Lucid Imagination, Inc.
14
- 16. Solr APIs
• HTTP GET/POST (curl or any other HTTP
client)
• JSON
• SolrJ (embedded or HTTP)
• solr-ruby
• python, PHP, solrsharp, XSLT
© 2008-2009 Lucid Imagination, Inc.
16
- 17. Solr in Production
Incoming Search
Requests
Load Balancer
Solr
Solr Master
Solr Master
Shard Request Shard Request
Load Balancer Load Balancer
Shard Shard
Shard Shard
Master 1..n Master
Replicant shards Replicant
Replicant Replicant
Replicant Replicant
Replicant Replicant
© 2008-2009 Lucid Imagination, Inc.
17
- 18. Getting Started:
It's This Easy
1.Start Solr
java -jar start.jar
2.Index your data
java -jar post.jar *.xml
3.Search
http://localhost:8983/solr
© 2008-2009 Lucid Imagination, Inc.
18
- 19. Configuration
• schema.xml
• field types and fields
• solrconfig.xml
• request handler mappings
• cache settings: filter, query, document
• warming listeners
• HTTP cache settings
• Lucene index parameters
• plugins: spell checking, highlighting
© 2008-2009 Lucid Imagination, Inc.
19
- 20. Solr add/update XML
<add><doc>
<field name="id">MA147LL/A</field>
<field name="name">Apple 60 GB iPod with Video Playback Black</field>
<field name="manu">Apple Computer Inc.</field>
<field name="cat">electronics</field>
<field name="cat">music</field>
<field name="features">iTunes, Podcasts, Audiobooks</field>
<field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of
video</field>
<field name="features">2.5-inch, 320x240 color TFT LCD display
with LED backlight</field>
<field name="features">Up to 20 hours of battery life</field>
<field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless,
H.264 video</field>
<field name="features">Notes, Calendar, Phone book, Hold button, Date display,
Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware,
USB 2.0 compatibility, Playback speed control, Rechargeable capability,
Battery level indication</field>
<field name="includes">earbud headphones, USB cable</field>
<field name="weight">5.5</field>
<field name="price">399.00</field>
<field name="popularity">10</field>
<field name="inStock">true</field>
</doc></add>
© 2008-2009 Lucid Imagination, Inc.
20
- 21. Indexing Solr XML
• Via curl:'http://localhost:8983/
curl
solr/update?commit=true' --
data-binary @ipod_video.xml -
H 'Content-type:text/xml;
charset=utf-8'
• Via Solr's Java-based post tool:
java -jar post.jar ipod_video.xml
© 2008-2009 Lucid Imagination, Inc.
21
- 23. Content Streams
• Allows Solr server to fetch local or remote data
itself. Must enable remote streaming in
solrconfig.xml
• http://localhost:8983/solr/update?stream.file=<local
Solr path to exampledocs>/ipod_video.xml
• &stream.url=<url to content>
• Security warning: allows Solr to fetch arbitrary
server-side file or network URL content
© 2008-2009 Lucid Imagination, Inc.
23
- 24. Indexing Rich Documents
curl 'http://localhost:8983/solr/update/
extract?
literal.id=doc1&commit=true&extractOnly=true
&wt=ruby&indent=on' -F
"myfile=@tutorial.html"
© 2008-2009 Lucid Imagination, Inc.
24
- 25. Indexing with SolrJ
SolrServer solr =
new CommonsHttpSolrServer(new URL("http://localhost:8983/solr"));
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "JAVAZONE_09");
doc.addField("title", "JavaZone 2009 SolrJ Example");
solr.add(doc);
solr.commit(); // after a batch, not per document
solr.optimize(); // periodically, when needed
© 2008-2009 Lucid Imagination, Inc.
25
- 26. Indexing with Ruby
solr = Connection.new(
'http://localhost:8983/solr',
:autocommit => :on)
solr.add(:id => 123,
:title => 'Solr in Action')
solr.optimize # periodically, as needed
© 2008-2009 Lucid Imagination, Inc.
26
- 27. Data Import Handler
• Indexes relational database, XML data sources,
e-mail, and more
• Supports full and incremental/delta indexing
• Extensible with custom data sources,
transformers, etc
• http://wiki.apache.org/solr/DataImportHandler
© 2008-2009 Lucid Imagination, Inc.
27
- 29. Example Search Request
• http://localhost:8983/solr/select?q=query
• &start=50
• &rows=25
• &fq=filter+query
• &facet=on&facet.field=category
© 2008-2009 Lucid Imagination, Inc.
29
- 30. Debug Query
• &debugQuery=true is your friend
• Includes parsed query, explanations, and
search component timings in response
© 2008-2009 Lucid Imagination, Inc.
30
- 31. Query Parser
• Controlled by defType parameter
• &defType=lucene (actually a Solr
extension of Lucene’s QueryParser)
• &defType=dismax
• Local {!..} override syntax
© 2008-2009 Lucid Imagination, Inc.
31
- 32. Solr Query Parser
• http://lucene.apache.org/java/2_4_0/
queryparsersyntax.html + Solr extensions
• Kitchen sink parser, includes advanced user-
unfriendly syntax
• Syntax errors throw parse exceptions back
to client
• Example: title:ipod* AND price:[0 TO 100]
© 2008-2009 Lucid Imagination, Inc.
32
- 33. Dismax Query Parser
• Simplified syntax:
loose text “quote phrases” -prohibited
+required
• Spreads query terms across query fields
(qf) with dynamic boosting per field, implicit
phrase construction (pf), boosting function
(bf), boosting query (bq), and minimum
match (mm)
© 2008-2009 Lucid Imagination, Inc.
33
- 34. Searching with SolrJ
SolrServer server = new CommonsHttpSolrServer("http://
localhost:8983/solr");
SolrQuery params = new SolrQuery("author:John");
params.setFields("*,score");
params.setRows(3);
QueryResponse response = server.query(params);
for (SolrDocument document : response.getResults()) {
System.out.println("Doc: " + document);
}
© 2008-2009 Lucid Imagination, Inc.
34
- 35. Searching with Ruby
conn = Connection.new(
'http://localhost:8983/solr')
conn.query('my query') do |hit|
puts hit.inspect
end
© 2008-2009 Lucid Imagination, Inc.
35
- 36. delete, update, etc
• Delete:
• <delete><id>05991</id></delete>
• <delete>
<query>category:Unused</query>
</delete>
• java -Ddata=args -jar post.jar
"<delete><query>*:*</query></delete>"
• Update: simply <add> doc with same unique key
• Commit: <commit/>
• Optimize: <optimize/>
© 2008-2009 Lucid Imagination, Inc.
36
- 37. Faceting
• Counts per subset within results
• Facet on: field terms, queries, date
ranges
• &facet=on
&facet.field=cat
&facet.query=price:[0 TO 100]
• http://wiki.apache.org/solr/
SimpleFacetParameters
© 2008-2009 Lucid Imagination, Inc.
37
- 38. Spell checking
• Not enabled by default, see example config to wire it in
• http://localhost:8983/solr/spell?
q=epod&spellcheck=on&spellcheck.build=true
• File or index-based dictionaries
• Supports pluggable distance algorithms: Levenstein and
JaroWinkler
• http://wiki.apache.org/solr/SpellCheckComponent
© 2008-2009 Lucid Imagination, Inc.
38
- 40. More Like This
• http://localhost:8983/solr/select?
q=ipod&mlt=true&mlt.fl=manu,cat&mlt.min
df=1&mlt.mintf=1&fl=id,score,name
• http://wiki.apache.org/solr/MoreLikeThis
© 2008-2009 Lucid Imagination, Inc.
40
- 41. Scaling: Query Throughput
• Replication
• slaves poll master for index updates
• transfers index files from master to slave
• configuration files can also be transferred
• entirely Java/HTTP-based in Solr 1.4
(prior versions used rsync)
© 2008-2009 Lucid Imagination, Inc.
41
- 42. Scaling: Collection Size
• Distribution
• Index documents across shards
• query single server with shards
parameter
• sends requests to each shard
• aggregates result to a single response
© 2008-2009 Lucid Imagination, Inc.
42
- 43. Solr-powered UI
• Solritas (from "celeritas"):
VelocityResponseWriter
• easily templated output
• SolrJS: jQuery-based widgets
• see http://solrjs.solrstuff.org/
• Blacklight and Flare: RoR plugins
© 2008-2009 Lucid Imagination, Inc.
43
- 44. Lucene in Action, 2nd Edition
http://www.manning.com/lucene
© 2008-2009 Lucid Imagination, Inc.
44
- 46. /")$/#$0(#
!"#$%&'()*$+),$-+&$0&,12&#-((23#$)4&2+,$,5&-6 78)#12&
!"#2+29:-43&2#-050,2(
!"#$%&,2)(&$+#4"%20&,12&4)3*20,&#-442#,$-+&-6&
!"#2+29:-43&#-(($,,230.&#-+,3$;",-30&)+%&$+64"2+#230&
<"3&($00$-+&$0&,-&023=2&)0&!"#$%#&'#($)*$+,-#..#&-#$6-3&
!"#2+29:-43>;)02%&02)3#1&0-4",$-+0
?248&-"3&#"0,-(230&*2,&,12&(-0,&-",&-6&!"#2+29:-43&> !"#$%&'(
(-0,&@$%245&"02%&-82+&0-"3#2&02)3#1&0-6,@)32&&&
A&BCCD>BCCE
© 2008-2009 !"#$%&'()*$+),$-+.&'+#/Inc.
Lucid Imagination, !"#$%$&'()*+',%-'./$0+'*)1)2',+$'.+,-$3,+42')5'./$'67,#/$'()5.8,+$'9)"%-,.0)%
46
- 47. !"#$%&'()*$+),$-+&./#0+$#)1&./)(
! 2-+$3&4//1/56 ! <)8#&F8/11/+9,/$+6
012),-1&-3&4-51&&
Unique !"#2+264-51&#-(($,,21.&780&(2(921
0-;3-"+%21.&0=G64H7.&<-1,:21+&!$*:,
Combination of ! 78)+,&'+*/89-116
H7&42)1#:.&0=G.&I5J2K$21
Enterprise Search !"#$%&"'&(')*+,#-#'.&&%'!$/01 ! @8$)+&G$+3/8,-+6
and Lucene !"#2+264-51&#-(($,,21.&0:)$1.&780 L2K25-@2%&M2901)N521.&,:2&N29OJ&3$1J,&
! :8$3&;),#0/86 #-(@12:2+J$K2&J2)1#:&2+*$+2&
Expertise
0-;$+%2&"'&(')*+,#-#'3-'4,%3&-1'5&&6 71$+#$@)5&P1#:$,2#,&),&PF
!"#2+264-51&#-(($,,21.&780&(2(921 ! 4$(-+&H-9/+,0)16
! <)83&<$11/8 4-5",$-+J&)1#:$,2#,.&<-1,:21+&!$*:,
!"#2+264-51&#-(($,,21.&780&
(2(921 ! I)5&;$116
! 4)($&4$8/+ 4-5",$-+J&P1#:$,2#,.&M255J&Q)1*-
<",#:6=$>)&#-(($,,21.&780&(2(921
! H5)+&<#F$+1/56
! =+%8>/?&@$1)1/#3$&
!"#2+264-51&#-(($,,21.&&780&(2(921
!"#2+26<",#:6?)%--@&#-(($,,21.&780&
(2(921&
! B08$9&;-9,/,,/86&C=%D$9-8E
! A-"*&B",,$+*6&C=%D$9-8E
!"#2+264-51&#-(($,,21.&&780&(2(921
012),-1&-3&!"#2+2.&<",#:&A&?)%--@
82(921&P@)#:2&4-3,N)12&Q-"+%),$-+
B&CDDE;CDDF
© 2008-2009 !"#$%&'()*$+),$-+.&'+#/
Lucid Imagination, Inc.
47
- 48. !"#$%&'()*$+),$-+&."/$+0//&1-%02
;:00
<-=+2-)%
()*+,-,./+"0+,/.1)
2+,*.3.+4"5./*,.67*.1)/
& 8,++"&
3)2"04)%%&567
!"#0+0
89*:)%0
>9)#?0@-:*
2199+,:.;<""=7--1,*>" ?,;.).)@>" 21)/7<*.)@"
!"#$$%&#$$'
© 2008-2009 A7:.4"B9;@.);*.1) 21)3.4+)*.;< !"#$%$&'()*+',%-'./$0+'*)1)2',+$'.+,-$3,+42')5'./$'67,#/$'()5.8,+$'9)"%-,.0)%
Lucid Imagination, Inc.
48
- 49. Thank you
http://www.lucidimagination.com
© 2008-2009 Lucid Imagination, Inc.
49