Solr Powered Lucene
             Erik Hatcher
           Lucid Imagination
  erik.hatcher@lucidimagination.com



  Northe...
Erik Hatcher
• Member of Technical Staff, Lucid
  Imagination
• Apache Lucene/Solr Committer
• Member, Apache Software Fou...
A word from our
           sponsor...
•   Pizza!

•   Lucid Imagination

    •   commercial entity exclusively dedicated t...
What is Solr?
• Search server
• Built upon Apache Lucene (Java)
• Fast, very
• Scalable, query load and collection size
• ...
5
Solr Example
•   Start Solr

    •   java -jar start.jar (Apache Solr distro)

    •   lucidworks start (Lucid certified di...
Solr History
• Created by Yonik Seeley for CNET
• Contributed to Apache in January 2006
• December 2006:Version 1.
• June ...
Features
•   Lucene power exposed over HTTP

    •   Spell checking, highlighting,
        more-like-this

•   Caching

• ...
Solr APIs
• HTTP GET/POST (curl or any other HTTP
  client)
• JSON
• SolrJ (embedded or HTTP)
• Ruby: solr-ruby, RSolr, et...
Deployment
                   Architecture
•   Scales from:

    •   single Solr server

    •   master/replicants(slaves)...
Lucene Fundamentals




                      11
Concepts

• Index
• Document
• Field
• Terms (aka Tokens)

                       12
Inverted Index
                                From "Taming Text" by Grant Ingersoll and Tom Morton
•   Commonly used sear...
14
What's in a token?




                     15
Lucene Scoring
        d1




    Θ        q1




                  16
Relevance
•   Term frequency (TF): number of times a term
    appears in a document

•   Inverse document frequency (IDF):...
Solr Core

• single primary index
• schema.xml / solrconfig.xml
• (optionally) multiple cores per Solr
  instance, configure...
schema.xml
•   Field types
•   Fields
•   Unique key (optional*)
    * I've never seen a case that didn't require a
    un...
Schema Analysis
•   http://localhost:8983/solr/admin/analysis.jsp

•   Document analysis request handler:
    curl http://...
solrconfig.xml
• Lucene indexing parameters
• Cache settings
• Request handler configuration
• HTTP cache settings
• Search ...
Request handlers
• mini-“servlets”
• SearchHandler extensions chain search
  components
• Flexible response formatting:
 •...
Solr XML
<add><doc>
  <field name="id">MA147LL/A</field>
  <field name="name">Apple 60 GB iPod with Video Playback Black</...
Indexing Solr XML

• Via curl:
  curl 'http://localhost:8983/solr/update?
  commit=true' --data-binary
  @ipod_video.xml -...
Indexing CSV

curl 'http://localhost:8983/solr/update/csv?
commit=true' --data-binary @books.csv -H 'Content-
type:text/pl...
Content Streams
•   Allows Solr server to fetch local or remote data
    itself. Must enable remote streaming in
    solrc...
Indexing with SolrJ
SolrServer solr =
    new CommonsHttpSolrServer(
        new URL("http://localhost:8983/solr"));

Solr...
Indexing with solr-ruby
solr = Connection.new(
  'http://localhost:8983/solr',
  :autocommit => :on

solr.add(:id => 123,
...
delete, update, etc
          •   Delete:

              •   <delete><id>05991</id></delete>

              •   <delete>
 ...
Data Import Handler

•   Indexes relational database, XML data, and e-
    mail sources

•   Supports full and incremental...
DIH details
• Put JDBC driver JAR in <solr-home>/lib,
  configure dataimport request handler
• http://localhost:8983/solr/d...
Solr Cell
• aka ExtractingRequestHandler
• leveraging Tika, extracts and indexes rich
    documents such as Word, PDF, HTM...
Standard Search
          Request


• http://localhost:8983/solr/select?q=query




                                      ...
Debug Query


• &debugQuery=true is your friend
• Includes parsed query, explanations, and
  search component timings in r...
Searching
• Send GET HTTP requests
  •   http://localhost:8983/solr/select?
      q=solr&start=0&rows=10&fl=id,name

• sta...
Query Parser

• Controlled by defType parameter
 • &defType=lucene (actually a Solr
    extension of Lucene’s QueryParser)...
Solr Query Parser
•   http://lucene.apache.org/java/2_9_1/
    queryparsersyntax.html+ Solr extensions

•   Kitchen sink p...
Dismax Query Parser
• Simplified syntax:
  loose text “quote phrases” -prohibited
  +required
• Spreads query terms across ...
Searching with SolrJ
SolrServer server = new
      CommonsHttpSolrServer("http://localhost:8983/solr");

SolrQuery params ...
Searching with Ruby
conn = Connection.new(
    'http://localhost:8983/solr')

conn.query('my query') do |hit|
  puts hit.i...
Built-in search
          components
• Standard: query, facet, mlt, highlight,
             stats, debug
• Others: elevati...
Faceting
•   Counts per subset within results

•   Facet on: field terms, queries,
    date ranges

•   &facet=on
    &face...
Spell checking
• http://localhost:8983/solr/spell?
  q=epod&spellcheck=on&spellcheck.build
  =true
• File or index-based d...
Highlighting

• http://localhost:8983/solr/select?
  q=apple&hl=on&hl.fl=*
• http://wiki.apache.org/solr/
  HighlightingPar...
More Like This

• http://localhost:8983/solr/select?
  q=ipod&mlt=true&mlt.fl=manu,cat&mlt.min
  df=1&mlt.mintf=1&fl=id,scor...
Query Elevation
• http://localhost:8983/solr/elevate?
  q=ipod&debugQuery=true&enableElevation
  =true
• Configure an “elev...
Clustering
•   Dynamic grouping of documents into labeled
    sets

•   http://localhost:8983/solr/clustering?
    q=*:*&r...
Terms

• Enumerates terms from specified fields
• http://localhost:8983/solr/terms?
  terms.fl=name&terms.sort=index&terms.pr...
Term Vectors

• Details term vector information: term
  frequency, document frequency, position
  and offset information
•...
stats.jsp
• Not technically a “request handler”, outputs
  only XML
• http://localhost:8983/solr/admin/stats.jsp
• Index s...
Replication
• Master is polled
• Replicant pulls Lucene index and optionally
  also Solr configuration files
• Query through...
Distributed Search

• Distribute documents to same-schema
  shards
• Scaling for when single index becomes too
  large, or...
What’s new in Solr 1.4?
•   Java-based replication   •   StatsComponent

•   VelocityResponseWriter   •   TermVectorCompon...
Lucene 2.9
•   IndexReader#reopen()

•   Faster filter performance, by 300% in some cases

•   Per-segment FieldCache

•   ...
Performance
       Improvements
• Caching
• Concurrent file access
• Per-segment index updates
• Faceting
• DocSet generati...
Feature Improvements
• Rich document          • Multi-select faceting
  indexing
                         • Speedier range...
Resources
•   http://wiki.apache.org/solr

•   solr-user@lucene.apache.org



•   Lucid Imagination

    •   http://www.lu...
Shout Out




            58
e-book now available!
        print coming soon
http://www.manning.com/lucene

                                59
LucidWorks for Solr
•   Certified Distribution

•   Value-added integration

    •   KStemmer

    •   Carrot2 clustering

...
LucidGaze for Solr

• Monitoring tool, captures, stores, and
  interactively views Solr performance
  metrics
• requests/s...
62
LucidFind




http://search.lucidimagination.com/?q=novajug


                                                63
64
Upcoming SlideShare
Loading in...5
×

Solr Powered Lucene

12,507

Published on

"Solr Powered Lucene", presented to the Northern VA Java Users Group on December 16, 2009

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,507
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
484
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

Solr Powered Lucene

  1. 1. Solr Powered Lucene Erik Hatcher Lucid Imagination erik.hatcher@lucidimagination.com Northern Virginia Java Users Group December 16, 2009 1
  2. 2. Erik Hatcher • Member of Technical Staff, Lucid Imagination • Apache Lucene/Solr Committer • Member, Apache Software Foundation • Co-author, Lucene in Action and Java Development with Ant (Manning) 2
  3. 3. A word from our sponsor... • Pizza! • Lucid Imagination • commercial entity exclusively dedicated to Apache Lucene/ Solr open source search technology • Services: Technical Support, Expert Link, Training, Consulting • Tools: LucidGaze for Lucene and Solr • Free certified distributions of Solr and Lucene • more to come... 3
  4. 4. What is Solr? • Search server • Built upon Apache Lucene (Java) • Fast, very • Scalable, query load and collection size • Interoperable • Extensible 4
  5. 5. 5
  6. 6. Solr Example • Start Solr • java -jar start.jar (Apache Solr distro) • lucidworks start (Lucid certified distro) • java -jar post.jar *.xml • HTML view (via Solritas) • easily enabled in Apache distro • built-in to Lucid certified distro 6
  7. 7. Solr History • Created by Yonik Seeley for CNET • Contributed to Apache in January 2006 • December 2006:Version 1. • June 2007:Version 1.2 • September 2008:Version 1.3 • November 2009:Version 1.4 7
  8. 8. Features • Lucene power exposed over HTTP • Spell checking, highlighting, more-like-this • Caching • Replication • Faceting • Distributed search 8
  9. 9. Solr APIs • HTTP GET/POST (curl or any other HTTP client) • JSON • SolrJ (embedded or HTTP) • Ruby: solr-ruby, RSolr, etc • Many others: python, PHP, solrsharp, XSLT 9
  10. 10. Deployment Architecture • Scales from: • single Solr server • master/replicants(slaves) • distributed shards • Each Solr instance can also have multiple cores 10
  11. 11. Lucene Fundamentals 11
  12. 12. Concepts • Index • Document • Field • Terms (aka Tokens) 12
  13. 13. Inverted Index From "Taming Text" by Grant Ingersoll and Tom Morton • Commonly used search engine data structure • Efficient lookup of terms across large number of documents • Usually stores positional information to enable phrase/proximity queries 13
  14. 14. 14
  15. 15. What's in a token? 15
  16. 16. Lucene Scoring d1 Θ q1 16
  17. 17. Relevance • Term frequency (TF): number of times a term appears in a document • Inverse document frequency (IDF): One over number of times term appears in the index (1/ df) • Field length normalization: control affect field length, in number of terms, has on score • Boost factors: terms, fields, or documents 17
  18. 18. Solr Core • single primary index • schema.xml / solrconfig.xml • (optionally) multiple cores per Solr instance, configured in solr.xml • other configuration and data files 18
  19. 19. schema.xml • Field types • Fields • Unique key (optional*) * I've never seen a case that didn't require a unique identifier per document • copy fields • similarity and Solr query parser configuration 19
  20. 20. Schema Analysis • http://localhost:8983/solr/admin/analysis.jsp • Document analysis request handler: curl http://localhost:8983/solr/analysis/ document --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8' • Field analysis request handler: http://localhost:8983/solr/analysis/field? analysis.fieldtype=text&analysis.fieldvalu e=Foo%20Bar&q=foo&analysis.showmatch=true 20
  21. 21. solrconfig.xml • Lucene indexing parameters • Cache settings • Request handler configuration • HTTP cache settings • Search components, response writers, query parsers 21
  22. 22. Request handlers • mini-“servlets” • SearchHandler extensions chain search components • Flexible response formatting: • &wt=[json, ruby, xslt, php, phps, javabin, python,velocity] 22
  23. 23. Solr XML <add><doc> <field name="id">MA147LL/A</field> <field name="name">Apple 60 GB iPod with Video Playback Black</field> <field name="manu">Apple Computer Inc.</field> <field name="cat">electronics</field> <field name="cat">music</field> <field name="features">iTunes, Podcasts, Audiobooks</field> <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field> <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field> <field name="features">Up to 20 hours of battery life</field> <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field> <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field> <field name="includes">earbud headphones, USB cable</field> <field name="weight">5.5</field> <field name="price">399.00</field> <field name="popularity">10</field> <field name="inStock">true</field> </doc></add> 23
  24. 24. Indexing Solr XML • Via curl: curl 'http://localhost:8983/solr/update? commit=true' --data-binary @ipod_video.xml -H 'Content-type:text/ xml; charset=utf-8' • Via Solr's Java-based post tool: java -jar post.jar ipod_video.xml 24
  25. 25. Indexing CSV curl 'http://localhost:8983/solr/update/csv? commit=true' --data-binary @books.csv -H 'Content- type:text/plain; charset=utf-8' 25
  26. 26. Content Streams • Allows Solr server to fetch local or remote data itself. Must enable remote streaming in solrconfig.xml • http://localhost:8983/solr/update • ?stream.file=<local Solr path to exampledocs>/ipod_video.xml • ?stream.url=<url to content> • Security warning: allows Solr to fetch arbitrary server-side file or network URL content 26
  27. 27. Indexing with SolrJ SolrServer solr = new CommonsHttpSolrServer( new URL("http://localhost:8983/solr")); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", "EXAMPLEDOC01"); doc.addField("title", "NOVAJUG SolrJ Example"); solr.add(doc); solr.commit(); // after a batch, not per document solr.optimize(); // periodically, if/when needed 27
  28. 28. Indexing with solr-ruby solr = Connection.new( 'http://localhost:8983/solr', :autocommit => :on solr.add(:id => 123, :title => 'Solr in Action') solr.optimize # periodically, as needed 28
  29. 29. delete, update, etc • Delete: • <delete><id>05991</id></delete> • <delete> <query>category:Unused</query> </delete> • java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>" • Update: simply <add> doc with same unique key • <commit/> pending documents • <optimize/> index, squeezes out deleted documents, collapses segments • <rollback/> to last commit point Update commands via GET: http://localhost:8983/solr/update?stream.body=<commit/> 29
  30. 30. Data Import Handler • Indexes relational database, XML data, and e- mail sources • Supports full and incremental/delta indexing • Highly extensible with custom data sources, transformers, etc • http://wiki.apache.org/solr/DataImportHandler 30
  31. 31. DIH details • Put JDBC driver JAR in <solr-home>/lib, configure dataimport request handler • http://localhost:8983/solr/db/admin/ dataimport.jsp - debugging console • http://localhost:8983/solr/db/dataimport? command=full-import - removes all documents and imports from scratch 31
  32. 32. Solr Cell • aka ExtractingRequestHandler • leveraging Tika, extracts and indexes rich documents such as Word, PDF, HTML, and many other types • curl 'http://localhost:8983/solr/update/ extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html" • http://wiki.apache.org/solr/ ExtractingRequestHandler 32
  33. 33. Standard Search Request • http://localhost:8983/solr/select?q=query 33
  34. 34. Debug Query • &debugQuery=true is your friend • Includes parsed query, explanations, and search component timings in response 34
  35. 35. Searching • Send GET HTTP requests • http://localhost:8983/solr/select? q=solr&start=0&rows=10&fl=id,name • start: zero-based starting result • rows: number of hits to return • fl: list of stored fields to return 35
  36. 36. Query Parser • Controlled by defType parameter • &defType=lucene (actually a Solr extension of Lucene’s QueryParser) • &defType=dismax • Local {!...} override syntax 36
  37. 37. Solr Query Parser • http://lucene.apache.org/java/2_9_1/ queryparsersyntax.html+ Solr extensions • Kitchen sink parser, includes advanced user- unfriendly syntax • Syntax errors throw parse exceptions back to client • Example: title:ipod* AND price:[0 TO 100] • http://wiki.apache.org/solr/SolrQuerySyntax 37
  38. 38. Dismax Query Parser • Simplified syntax: loose text “quote phrases” -prohibited +required • Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf) 38
  39. 39. Searching with SolrJ SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr"); SolrQuery params = new SolrQuery("author:John"); params.setFields("*,score"); params.setRows(3); QueryResponse response = server.query(params); for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document); } 39
  40. 40. Searching with Ruby conn = Connection.new( 'http://localhost:8983/solr') conn.query('my query') do |hit| puts hit.inspect end 40
  41. 41. Built-in search components • Standard: query, facet, mlt, highlight, stats, debug • Others: elevation, clustering, term, term vector 41
  42. 42. Faceting • Counts per subset within results • Facet on: field terms, queries, date ranges • &facet=on &facet.field=cat &facet.query=price:[0 TO 100] • http://wiki.apache.org/solr/ SimpleFacetParameters 42
  43. 43. Spell checking • http://localhost:8983/solr/spell? q=epod&spellcheck=on&spellcheck.build =true • File or index-based dictionaries • Supports pluggable distance algorithms: Levenstein and JaroWinkler • http://wiki.apache.org/solr/ 43
  44. 44. Highlighting • http://localhost:8983/solr/select? q=apple&hl=on&hl.fl=* • http://wiki.apache.org/solr/ HighlightingParameters 44
  45. 45. More Like This • http://localhost:8983/solr/select? q=ipod&mlt=true&mlt.fl=manu,cat&mlt.min df=1&mlt.mintf=1&fl=id,score,name • http://wiki.apache.org/solr/MoreLikeThis 45
  46. 46. Query Elevation • http://localhost:8983/solr/elevate? q=ipod&debugQuery=true&enableElevation =true • Configure an “elevate.xml” to boost/ exclude specific documents • http://wiki.apache.org/solr/ QueryElevationComponent 46
  47. 47. Clustering • Dynamic grouping of documents into labeled sets • http://localhost:8983/solr/clustering? q=*:*&rows=10 • http://wiki.apache.org/solr/ ClusteringComponent • Requires additional steps to install (see documentation) with Apache Solr distro 47
  48. 48. Terms • Enumerates terms from specified fields • http://localhost:8983/solr/terms? terms.fl=name&terms.sort=index&terms.pr efix=vi 48
  49. 49. Term Vectors • Details term vector information: term frequency, document frequency, position and offset information • http://localhost:8983/solr/select/?q=* %3A*&qt=tvrh&tv=true&tv.all=true 49
  50. 50. stats.jsp • Not technically a “request handler”, outputs only XML • http://localhost:8983/solr/admin/stats.jsp • Index stats such as number of documents, searcher open time • Request handler details, number of requests and errors, average request time, 50
  51. 51. Replication • Master is polled • Replicant pulls Lucene index and optionally also Solr configuration files • Query throughput scaling: replicate and load balance • http://wiki.apache.org/solr/SolrReplication 51
  52. 52. Distributed Search • Distribute documents to same-schema shards • Scaling for when single index becomes too large, or a single query becomes too slow • http://wiki.apache.org/solr/ DistributedSearch 52
  53. 53. What’s new in Solr 1.4? • Java-based replication • StatsComponent • VelocityResponseWriter • TermVectorComponent (Solritas) • Configurable Directory • AJAX-Solr provider • Logging switched to SLF4J • Rollback, since last commit 53
  54. 54. Lucene 2.9 • IndexReader#reopen() • Faster filter performance, by 300% in some cases • Per-segment FieldCache • Reusable token streams • Faster numeric/date range queries, thanks to trie • and tons more, see Lucene 2.9's CHANGES.txt 54
  55. 55. Performance Improvements • Caching • Concurrent file access • Per-segment index updates • Faceting • DocSet generation, avoids scoring • Streaming updates for SolrJ 55
  56. 56. Feature Improvements • Rich document • Multi-select faceting indexing • Speedier range • DataImportHandler queries enhancements • Duplicate detection • Smoother replication • New request handler • More choices for components logging 56
  57. 57. Resources • http://wiki.apache.org/solr • solr-user@lucene.apache.org • Lucid Imagination • http://www.lucidimagination.com • Articles, webinars, blogs, and... • Search the Lucene ecosystem at: http://search.lucidimagination.com • support@lucidimagination.com 57
  58. 58. Shout Out 58
  59. 59. e-book now available! print coming soon http://www.manning.com/lucene 59
  60. 60. LucidWorks for Solr • Certified Distribution • Value-added integration • KStemmer • Carrot2 clustering • LucidGaze for Solr • installer • Reference Manual • Solr 1.4 certified distro coming soon! 60
  61. 61. LucidGaze for Solr • Monitoring tool, captures, stores, and interactively views Solr performance metrics • requests/second • time/request 61
  62. 62. 62
  63. 63. LucidFind http://search.lucidimagination.com/?q=novajug 63
  64. 64. 64
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×