Scenario Library et REX Discover industry- and role- based scenarios
Solr Powered Lucene
1. Solr Powered Lucene
Erik Hatcher
Lucid Imagination
erik.hatcher@lucidimagination.com
Northern Virginia Java Users Group
December 16, 2009
1
2. Erik Hatcher
• Member of Technical Staff, Lucid
Imagination
• Apache Lucene/Solr Committer
• Member, Apache Software Foundation
• Co-author, Lucene in Action and Java
Development with Ant (Manning)
2
3. A word from our
sponsor...
• Pizza!
• Lucid Imagination
• commercial entity exclusively dedicated to Apache Lucene/
Solr open source search technology
• Services: Technical Support, Expert Link, Training, Consulting
• Tools: LucidGaze for Lucene and Solr
• Free certified distributions of Solr and Lucene
• more to come...
3
4. What is Solr?
• Search server
• Built upon Apache Lucene (Java)
• Fast, very
• Scalable, query load and collection size
• Interoperable
• Extensible
4
7. Solr History
• Created by Yonik Seeley for CNET
• Contributed to Apache in January 2006
• December 2006:Version 1.
• June 2007:Version 1.2
• September 2008:Version 1.3
• November 2009:Version 1.4
7
8. Features
• Lucene power exposed over HTTP
• Spell checking, highlighting,
more-like-this
• Caching
• Replication
• Faceting
• Distributed search
8
9. Solr APIs
• HTTP GET/POST (curl or any other HTTP
client)
• JSON
• SolrJ (embedded or HTTP)
• Ruby: solr-ruby, RSolr, etc
• Many others: python, PHP, solrsharp, XSLT
9
10. Deployment
Architecture
• Scales from:
• single Solr server
• master/replicants(slaves)
• distributed shards
• Each Solr instance can also
have multiple cores
10
13. Inverted Index
From "Taming Text" by Grant Ingersoll and Tom Morton
• Commonly used search
engine data structure
• Efficient lookup of terms
across large number of
documents
• Usually stores positional
information to enable
phrase/proximity queries
13
17. Relevance
• Term frequency (TF): number of times a term
appears in a document
• Inverse document frequency (IDF): One over
number of times term appears in the index (1/
df)
• Field length normalization: control affect field
length, in number of terms, has on score
• Boost factors: terms, fields, or documents
17
18. Solr Core
• single primary index
• schema.xml / solrconfig.xml
• (optionally) multiple cores per Solr
instance, configured in solr.xml
• other configuration and data files
18
19. schema.xml
• Field types
• Fields
• Unique key (optional*)
* I've never seen a case that didn't require a
unique identifier per document
• copy fields
• similarity and Solr query parser configuration
19
26. Content Streams
• Allows Solr server to fetch local or remote data
itself. Must enable remote streaming in
solrconfig.xml
• http://localhost:8983/solr/update
• ?stream.file=<local Solr path to
exampledocs>/ipod_video.xml
• ?stream.url=<url to content>
• Security warning: allows Solr to fetch arbitrary
server-side file or network URL content
26
27. Indexing with SolrJ
SolrServer solr =
new CommonsHttpSolrServer(
new URL("http://localhost:8983/solr"));
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "EXAMPLEDOC01");
doc.addField("title", "NOVAJUG SolrJ Example");
solr.add(doc);
solr.commit(); // after a batch, not per document
solr.optimize(); // periodically, if/when needed
27
28. Indexing with solr-ruby
solr = Connection.new(
'http://localhost:8983/solr',
:autocommit => :on
solr.add(:id => 123,
:title => 'Solr in Action')
solr.optimize # periodically, as needed
28
29. delete, update, etc
• Delete:
• <delete><id>05991</id></delete>
• <delete>
<query>category:Unused</query>
</delete>
• java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
• Update: simply <add> doc with same unique key
• <commit/> pending documents
• <optimize/> index, squeezes out deleted documents, collapses segments
• <rollback/> to last commit point
Update commands via GET: http://localhost:8983/solr/update?stream.body=<commit/>
29
30. Data Import Handler
• Indexes relational database, XML data, and e-
mail sources
• Supports full and incremental/delta indexing
• Highly extensible with custom data sources,
transformers, etc
• http://wiki.apache.org/solr/DataImportHandler
30
31. DIH details
• Put JDBC driver JAR in <solr-home>/lib,
configure dataimport request handler
• http://localhost:8983/solr/db/admin/
dataimport.jsp - debugging console
• http://localhost:8983/solr/db/dataimport?
command=full-import - removes all
documents and imports from scratch
31
32. Solr Cell
• aka ExtractingRequestHandler
• leveraging Tika, extracts and indexes rich
documents such as Word, PDF, HTML, and
many other types
• curl 'http://localhost:8983/solr/update/
extract?literal.id=doc1&commit=true' -F
"myfile=@tutorial.html"
• http://wiki.apache.org/solr/
ExtractingRequestHandler
32
33. Standard Search
Request
• http://localhost:8983/solr/select?q=query
33
34. Debug Query
• &debugQuery=true is your friend
• Includes parsed query, explanations, and
search component timings in response
34
35. Searching
• Send GET HTTP requests
• http://localhost:8983/solr/select?
q=solr&start=0&rows=10&fl=id,name
• start: zero-based starting result
• rows: number of hits to return
• fl: list of stored fields to return
35
36. Query Parser
• Controlled by defType parameter
• &defType=lucene (actually a Solr
extension of Lucene’s QueryParser)
• &defType=dismax
• Local {!...} override syntax
36
37. Solr Query Parser
• http://lucene.apache.org/java/2_9_1/
queryparsersyntax.html+ Solr extensions
• Kitchen sink parser, includes advanced user-
unfriendly syntax
• Syntax errors throw parse exceptions back to
client
• Example: title:ipod* AND price:[0 TO 100]
• http://wiki.apache.org/solr/SolrQuerySyntax
37
38. Dismax Query Parser
• Simplified syntax:
loose text “quote phrases” -prohibited
+required
• Spreads query terms across query fields
(qf) with dynamic boosting per field, phrase
construction (pf), and boosting query and
function capabilities (bq and bf)
38
39. Searching with SolrJ
SolrServer server = new
CommonsHttpSolrServer("http://localhost:8983/solr");
SolrQuery params = new SolrQuery("author:John");
params.setFields("*,score");
params.setRows(3);
QueryResponse response = server.query(params);
for (SolrDocument document : response.getResults()) {
System.out.println("Doc: " + document);
}
39
40. Searching with Ruby
conn = Connection.new(
'http://localhost:8983/solr')
conn.query('my query') do |hit|
puts hit.inspect
end
40
45. More Like This
• http://localhost:8983/solr/select?
q=ipod&mlt=true&mlt.fl=manu,cat&mlt.min
df=1&mlt.mintf=1&fl=id,score,name
• http://wiki.apache.org/solr/MoreLikeThis
45
46. Query Elevation
• http://localhost:8983/solr/elevate?
q=ipod&debugQuery=true&enableElevation
=true
• Configure an “elevate.xml” to boost/
exclude specific documents
• http://wiki.apache.org/solr/
QueryElevationComponent
46
47. Clustering
• Dynamic grouping of documents into labeled
sets
• http://localhost:8983/solr/clustering?
q=*:*&rows=10
• http://wiki.apache.org/solr/
ClusteringComponent
• Requires additional steps to install (see
documentation) with Apache Solr distro
47
49. Term Vectors
• Details term vector information: term
frequency, document frequency, position
and offset information
• http://localhost:8983/solr/select/?q=*
%3A*&qt=tvrh&tv=true&tv.all=true
49
50. stats.jsp
• Not technically a “request handler”, outputs
only XML
• http://localhost:8983/solr/admin/stats.jsp
• Index stats such as number of documents,
searcher open time
• Request handler details, number of
requests and errors, average request time,
50
51. Replication
• Master is polled
• Replicant pulls Lucene index and optionally
also Solr configuration files
• Query throughput scaling: replicate and
load balance
• http://wiki.apache.org/solr/SolrReplication
51
52. Distributed Search
• Distribute documents to same-schema
shards
• Scaling for when single index becomes too
large, or a single query becomes too slow
• http://wiki.apache.org/solr/
DistributedSearch
52
53. What’s new in Solr 1.4?
• Java-based replication • StatsComponent
• VelocityResponseWriter • TermVectorComponent
(Solritas)
• Configurable Directory
• AJAX-Solr provider
• Logging switched to
SLF4J
• Rollback, since last
commit
53
54. Lucene 2.9
• IndexReader#reopen()
• Faster filter performance, by 300% in some cases
• Per-segment FieldCache
• Reusable token streams
• Faster numeric/date range queries, thanks to trie
• and tons more, see Lucene 2.9's CHANGES.txt
54