Solr Powered Lucene
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Solr Powered Lucene

  • 14,200 views
Uploaded on

"Solr Powered Lucene", presented to the Northern VA Java Users Group on December 16, 2009

"Solr Powered Lucene", presented to the Northern VA Java Users Group on December 16, 2009

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
14,200
On Slideshare
13,983
From Embeds
217
Number of Embeds
7

Actions

Shares
Downloads
469
Comments
0
Likes
13

Embeds 217

http://www.lucidimagination.com 111
http://www.slideshare.net 71
http://blog.newitfarmer.com 26
http://searchhub.org 4
http://www.mefeedia.com 2
http://webcache.googleusercontent.com 2
http://lucidsearchhub.stephenz.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Solr Powered Lucene Erik Hatcher Lucid Imagination erik.hatcher@lucidimagination.com Northern Virginia Java Users Group December 16, 2009 1
  • 2. Erik Hatcher • Member of Technical Staff, Lucid Imagination • Apache Lucene/Solr Committer • Member, Apache Software Foundation • Co-author, Lucene in Action and Java Development with Ant (Manning) 2
  • 3. A word from our sponsor... • Pizza! • Lucid Imagination • commercial entity exclusively dedicated to Apache Lucene/ Solr open source search technology • Services: Technical Support, Expert Link, Training, Consulting • Tools: LucidGaze for Lucene and Solr • Free certified distributions of Solr and Lucene • more to come... 3
  • 4. What is Solr? • Search server • Built upon Apache Lucene (Java) • Fast, very • Scalable, query load and collection size • Interoperable • Extensible 4
  • 5. 5
  • 6. Solr Example • Start Solr • java -jar start.jar (Apache Solr distro) • lucidworks start (Lucid certified distro) • java -jar post.jar *.xml • HTML view (via Solritas) • easily enabled in Apache distro • built-in to Lucid certified distro 6
  • 7. Solr History • Created by Yonik Seeley for CNET • Contributed to Apache in January 2006 • December 2006:Version 1. • June 2007:Version 1.2 • September 2008:Version 1.3 • November 2009:Version 1.4 7
  • 8. Features • Lucene power exposed over HTTP • Spell checking, highlighting, more-like-this • Caching • Replication • Faceting • Distributed search 8
  • 9. Solr APIs • HTTP GET/POST (curl or any other HTTP client) • JSON • SolrJ (embedded or HTTP) • Ruby: solr-ruby, RSolr, etc • Many others: python, PHP, solrsharp, XSLT 9
  • 10. Deployment Architecture • Scales from: • single Solr server • master/replicants(slaves) • distributed shards • Each Solr instance can also have multiple cores 10
  • 11. Lucene Fundamentals 11
  • 12. Concepts • Index • Document • Field • Terms (aka Tokens) 12
  • 13. Inverted Index From "Taming Text" by Grant Ingersoll and Tom Morton • Commonly used search engine data structure • Efficient lookup of terms across large number of documents • Usually stores positional information to enable phrase/proximity queries 13
  • 14. 14
  • 15. What's in a token? 15
  • 16. Lucene Scoring d1 Θ q1 16
  • 17. Relevance • Term frequency (TF): number of times a term appears in a document • Inverse document frequency (IDF): One over number of times term appears in the index (1/ df) • Field length normalization: control affect field length, in number of terms, has on score • Boost factors: terms, fields, or documents 17
  • 18. Solr Core • single primary index • schema.xml / solrconfig.xml • (optionally) multiple cores per Solr instance, configured in solr.xml • other configuration and data files 18
  • 19. schema.xml • Field types • Fields • Unique key (optional*) * I've never seen a case that didn't require a unique identifier per document • copy fields • similarity and Solr query parser configuration 19
  • 20. Schema Analysis • http://localhost:8983/solr/admin/analysis.jsp • Document analysis request handler: curl http://localhost:8983/solr/analysis/ document --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8' • Field analysis request handler: http://localhost:8983/solr/analysis/field? analysis.fieldtype=text&analysis.fieldvalu e=Foo%20Bar&q=foo&analysis.showmatch=true 20
  • 21. solrconfig.xml • Lucene indexing parameters • Cache settings • Request handler configuration • HTTP cache settings • Search components, response writers, query parsers 21
  • 22. Request handlers • mini-“servlets” • SearchHandler extensions chain search components • Flexible response formatting: • &wt=[json, ruby, xslt, php, phps, javabin, python,velocity] 22
  • 23. Solr XML <add><doc> <field name="id">MA147LL/A</field> <field name="name">Apple 60 GB iPod with Video Playback Black</field> <field name="manu">Apple Computer Inc.</field> <field name="cat">electronics</field> <field name="cat">music</field> <field name="features">iTunes, Podcasts, Audiobooks</field> <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field> <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field> <field name="features">Up to 20 hours of battery life</field> <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field> <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field> <field name="includes">earbud headphones, USB cable</field> <field name="weight">5.5</field> <field name="price">399.00</field> <field name="popularity">10</field> <field name="inStock">true</field> </doc></add> 23
  • 24. Indexing Solr XML • Via curl: curl 'http://localhost:8983/solr/update? commit=true' --data-binary @ipod_video.xml -H 'Content-type:text/ xml; charset=utf-8' • Via Solr's Java-based post tool: java -jar post.jar ipod_video.xml 24
  • 25. Indexing CSV curl 'http://localhost:8983/solr/update/csv? commit=true' --data-binary @books.csv -H 'Content- type:text/plain; charset=utf-8' 25
  • 26. Content Streams • Allows Solr server to fetch local or remote data itself. Must enable remote streaming in solrconfig.xml • http://localhost:8983/solr/update • ?stream.file=<local Solr path to exampledocs>/ipod_video.xml • ?stream.url=<url to content> • Security warning: allows Solr to fetch arbitrary server-side file or network URL content 26
  • 27. Indexing with SolrJ SolrServer solr = new CommonsHttpSolrServer( new URL("http://localhost:8983/solr")); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", "EXAMPLEDOC01"); doc.addField("title", "NOVAJUG SolrJ Example"); solr.add(doc); solr.commit(); // after a batch, not per document solr.optimize(); // periodically, if/when needed 27
  • 28. Indexing with solr-ruby solr = Connection.new( 'http://localhost:8983/solr', :autocommit => :on solr.add(:id => 123, :title => 'Solr in Action') solr.optimize # periodically, as needed 28
  • 29. delete, update, etc • Delete: • <delete><id>05991</id></delete> • <delete> <query>category:Unused</query> </delete> • java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>" • Update: simply <add> doc with same unique key • <commit/> pending documents • <optimize/> index, squeezes out deleted documents, collapses segments • <rollback/> to last commit point Update commands via GET: http://localhost:8983/solr/update?stream.body=<commit/> 29
  • 30. Data Import Handler • Indexes relational database, XML data, and e- mail sources • Supports full and incremental/delta indexing • Highly extensible with custom data sources, transformers, etc • http://wiki.apache.org/solr/DataImportHandler 30
  • 31. DIH details • Put JDBC driver JAR in <solr-home>/lib, configure dataimport request handler • http://localhost:8983/solr/db/admin/ dataimport.jsp - debugging console • http://localhost:8983/solr/db/dataimport? command=full-import - removes all documents and imports from scratch 31
  • 32. Solr Cell • aka ExtractingRequestHandler • leveraging Tika, extracts and indexes rich documents such as Word, PDF, HTML, and many other types • curl 'http://localhost:8983/solr/update/ extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html" • http://wiki.apache.org/solr/ ExtractingRequestHandler 32
  • 33. Standard Search Request • http://localhost:8983/solr/select?q=query 33
  • 34. Debug Query • &debugQuery=true is your friend • Includes parsed query, explanations, and search component timings in response 34
  • 35. Searching • Send GET HTTP requests • http://localhost:8983/solr/select? q=solr&start=0&rows=10&fl=id,name • start: zero-based starting result • rows: number of hits to return • fl: list of stored fields to return 35
  • 36. Query Parser • Controlled by defType parameter • &defType=lucene (actually a Solr extension of Lucene’s QueryParser) • &defType=dismax • Local {!...} override syntax 36
  • 37. Solr Query Parser • http://lucene.apache.org/java/2_9_1/ queryparsersyntax.html+ Solr extensions • Kitchen sink parser, includes advanced user- unfriendly syntax • Syntax errors throw parse exceptions back to client • Example: title:ipod* AND price:[0 TO 100] • http://wiki.apache.org/solr/SolrQuerySyntax 37
  • 38. Dismax Query Parser • Simplified syntax: loose text “quote phrases” -prohibited +required • Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf) 38
  • 39. Searching with SolrJ SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr"); SolrQuery params = new SolrQuery("author:John"); params.setFields("*,score"); params.setRows(3); QueryResponse response = server.query(params); for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document); } 39
  • 40. Searching with Ruby conn = Connection.new( 'http://localhost:8983/solr') conn.query('my query') do |hit| puts hit.inspect end 40
  • 41. Built-in search components • Standard: query, facet, mlt, highlight, stats, debug • Others: elevation, clustering, term, term vector 41
  • 42. Faceting • Counts per subset within results • Facet on: field terms, queries, date ranges • &facet=on &facet.field=cat &facet.query=price:[0 TO 100] • http://wiki.apache.org/solr/ SimpleFacetParameters 42
  • 43. Spell checking • http://localhost:8983/solr/spell? q=epod&spellcheck=on&spellcheck.build =true • File or index-based dictionaries • Supports pluggable distance algorithms: Levenstein and JaroWinkler • http://wiki.apache.org/solr/ 43
  • 44. Highlighting • http://localhost:8983/solr/select? q=apple&hl=on&hl.fl=* • http://wiki.apache.org/solr/ HighlightingParameters 44
  • 45. More Like This • http://localhost:8983/solr/select? q=ipod&mlt=true&mlt.fl=manu,cat&mlt.min df=1&mlt.mintf=1&fl=id,score,name • http://wiki.apache.org/solr/MoreLikeThis 45
  • 46. Query Elevation • http://localhost:8983/solr/elevate? q=ipod&debugQuery=true&enableElevation =true • Configure an “elevate.xml” to boost/ exclude specific documents • http://wiki.apache.org/solr/ QueryElevationComponent 46
  • 47. Clustering • Dynamic grouping of documents into labeled sets • http://localhost:8983/solr/clustering? q=*:*&rows=10 • http://wiki.apache.org/solr/ ClusteringComponent • Requires additional steps to install (see documentation) with Apache Solr distro 47
  • 48. Terms • Enumerates terms from specified fields • http://localhost:8983/solr/terms? terms.fl=name&terms.sort=index&terms.pr efix=vi 48
  • 49. Term Vectors • Details term vector information: term frequency, document frequency, position and offset information • http://localhost:8983/solr/select/?q=* %3A*&qt=tvrh&tv=true&tv.all=true 49
  • 50. stats.jsp • Not technically a “request handler”, outputs only XML • http://localhost:8983/solr/admin/stats.jsp • Index stats such as number of documents, searcher open time • Request handler details, number of requests and errors, average request time, 50
  • 51. Replication • Master is polled • Replicant pulls Lucene index and optionally also Solr configuration files • Query throughput scaling: replicate and load balance • http://wiki.apache.org/solr/SolrReplication 51
  • 52. Distributed Search • Distribute documents to same-schema shards • Scaling for when single index becomes too large, or a single query becomes too slow • http://wiki.apache.org/solr/ DistributedSearch 52
  • 53. What’s new in Solr 1.4? • Java-based replication • StatsComponent • VelocityResponseWriter • TermVectorComponent (Solritas) • Configurable Directory • AJAX-Solr provider • Logging switched to SLF4J • Rollback, since last commit 53
  • 54. Lucene 2.9 • IndexReader#reopen() • Faster filter performance, by 300% in some cases • Per-segment FieldCache • Reusable token streams • Faster numeric/date range queries, thanks to trie • and tons more, see Lucene 2.9's CHANGES.txt 54
  • 55. Performance Improvements • Caching • Concurrent file access • Per-segment index updates • Faceting • DocSet generation, avoids scoring • Streaming updates for SolrJ 55
  • 56. Feature Improvements • Rich document • Multi-select faceting indexing • Speedier range • DataImportHandler queries enhancements • Duplicate detection • Smoother replication • New request handler • More choices for components logging 56
  • 57. Resources • http://wiki.apache.org/solr • solr-user@lucene.apache.org • Lucid Imagination • http://www.lucidimagination.com • Articles, webinars, blogs, and... • Search the Lucene ecosystem at: http://search.lucidimagination.com • support@lucidimagination.com 57
  • 58. Shout Out 58
  • 59. e-book now available! print coming soon http://www.manning.com/lucene 59
  • 60. LucidWorks for Solr • Certified Distribution • Value-added integration • KStemmer • Carrot2 clustering • LucidGaze for Solr • installer • Reference Manual • Solr 1.4 certified distro coming soon! 60
  • 61. LucidGaze for Solr • Monitoring tool, captures, stores, and interactively views Solr performance metrics • requests/second • time/request 61
  • 62. 62
  • 63. LucidFind http://search.lucidimagination.com/?q=novajug 63
  • 64. 64