Introduction         to        Solr  NFJS - Boston, September 2011     Presented by Erik Hatchererik.hatcher@lucidimaginat...
About me...• Co-author, "Lucene in Action" (and "Java  Development with Ant" / "Ant in Action"  once upon a time)• "Apache...
About Lucid Imagination...•   Lucid Imagination provides commercial-grade    support, training, high-level consulting and ...
Abstract Apache Solr serves search requests at the enterprises andthe largest companies around the world. Built on top of ...
What is Solr?•   An open source search server•   Indexes content sources, processes query requests, returns    search resu...
Who uses Solr?  And many many many many more...!
Which Solr version?•   There’s more than one answer!•   The current, released, stable version is 3.3    (soon to be 3.4)• ...
What is Lucene?•   An open source search library (not an application)•   100% Java•   Continuously improved and tuned for ...
Inverted Index•   Lucene stores input data in what is known as an    inverted index•   In an inverted index each indexed t...
Inverted Index Example
Ingestion• API / Solr XML, JSON, and javabin/SolrJ• CSV• Relational databases• File system• Web crawl (using Nutch, or oth...
Solr indexing options
Solr XMLPOST to /update<add>  <doc>    <field name="id">rawxml1</field>    <field name="content_type">text/xml</field>    ...
Solr JSONPOST to /update/json[  {"id" : "TestDoc1", "title" : "test1"},  {"id" : "TestDoc2", "title" : "another test"}]
CSV indexing•   http://localhost:8983/solr/update/csv•   Files can be sent over HTTP:    •   curl http://localhost:8983/so...
Rich documents•   Solr uses Tika for extraction. Tika is a toolkit for detecting and    extracting metadata and structured...
Solr Cell parameters•   The literal parameter is very important.    •   A way to add other fields not indexed using Tika to...
Streaming remote docs• Streaming a file from a URL: • curl  http://localhost:8983/solr/    update/extract?    literal.id=12...
DataImportHandler•   An "in-process" module that can be used to index data directly    from relational databases and other...
DIH Examples• Rich documents• Relational database• E-mail
Other commands• <commit/> and <optimize/>• <delete>...</delete> • <id>Q-36</id> • <query>category:electronics</query>• To ...
Configuring Solr• schema.xml • defines field types, fields, and unique key• solrconfig.xml • Lucene settings • request handler,...
Searching Basics•   http://localhost:8983/solr/select?q=*:*    •   q - main query    •   rows - maximum number of "hits" t...
Other Common Search     Parameters• sort - specify sort criteria either by field(s)  or function(s) in ascending or descend...
Filtering results• Use fq to filter results in addition to main  query constraints• fq results are independently cached in ...
Typical Solr Request• http://localhost:8983/solr/select  ?q=ipod  &facet=on  &facet.field=cat  &fq=cat:electronics
Features•   Faceting              •   Distributed search•   Highlighting          •   Replication•   Spellchecking        ...
Integration• Its just HTTP • and CSV, JSON, XML, etc on the requests    and responses• Any language or environment can wor...
Ruby indexing example
SolrJ searching exampleSolrServer solrServer = new CommonsHttpSolrServer(   "http://localhost:8983/solr");SolrQuery query ...
Devilish Details• analysis: tokenization and token filtering• query parsing• relevancy tuning• performance and scalability
SolrMeterhttp://code.google.com/p/solrmeter/
e.g. data.gov
Data.gov CSV catalogURL,Title,Agency,Subagency,Category,Date Released,Date Updated,TimePeriod,Frequency,Description,Data.g...
Debugginghttp://localhost:8983/solr/data.gov?q=searching&debugQuery=true
Custom pages• Document detail page• Multiple query intersection comparison  with Venn visualization
Document detailhttp://localhost:8983/solr/data.gov/document?id=http%3A%2F%2Fwww.data.gov%2Fdetails%2F61
Query intersection• Just showing off.... how easy it is to do  something with a bit of visual impact• Compare three indepe...
What now?• Download Solr• "install" it (unzip it)• Start Solr: java -jar start.jar• Ingest your data• Iterate on schema & ...
UI / prototyping• Solritas - aka VelocityResponseWriter• Blacklight - projectblacklight.org
Blacklight @ UVa
Blacklight @ Stanford
For more information...•   http://www.lucidimagination.com•   LucidFind    •   search Lucene ecosystem: mailing lists, wik...
LucidFindhttp://www.lucidimagination.com/search/?q=user+interface
Thank You!
Introduction to Solr
Introduction to Solr
Upcoming SlideShare
Loading in...5
×

Introduction to Solr

4,386

Published on

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,386
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
119
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Introduction to Solr

  1. 1. Introduction to Solr NFJS - Boston, September 2011 Presented by Erik Hatchererik.hatcher@lucidimagination.com Lucid Imagination http://www.lucidimagination.com
  2. 2. About me...• Co-author, "Lucene in Action" (and "Java Development with Ant" / "Ant in Action" once upon a time)• "Apache guy" - Lucene/Solr committer; member of Lucene PMC, member of Apache Software Foundation• Co-founder, evangelist, trainer, coder @ Lucid Imagination
  3. 3. About Lucid Imagination...• Lucid Imagination provides commercial-grade support, training, high-level consulting and value- added software for Lucene and Solr.• We make Lucene ‘enterprise-ready’ by offering: • Free, certified, distributions and downloads. • Support, training, and consulting. • LucidWorks Enterprise, a commercial search platform built on top of Solr.
  4. 4. Abstract Apache Solr serves search requests at the enterprises andthe largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing andsearching integration into your applications straightforward.Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection sizewith distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
  5. 5. What is Solr?• An open source search server• Indexes content sources, processes query requests, returns search results• Uses Lucene as the "engine", but adds full enterprise search server features and capabilities• A web-based application that processes HTTP requests and returns HTTP responses.• Initially started in 2004 and developed by CNET as an in-house project to add search capability for the company website.• Donated to ASF in 2006.
  6. 6. Who uses Solr? And many many many many more...!
  7. 7. Which Solr version?• There’s more than one answer!• The current, released, stable version is 3.3 (soon to be 3.4)• The development release is referred to as “trunk”.• This is where the new, less tested work goes on• Also referred to as 4.0• LucidWorks Enterprise is built on a trunk snapshot + additional features.
  8. 8. What is Lucene?• An open source search library (not an application)• 100% Java• Continuously improved and tuned for more than 10 years• Compact, portable index representation• Programmable text analyzers, spell checking and highlighting• Not, itself, a crawler or a text extraction tool
  9. 9. Inverted Index• Lucene stores input data in what is known as an inverted index• In an inverted index each indexed term points to a list of documents that contain the term• Similar to the index provided at the end of a book• In this case "inverted" simply means the list of terms point to documents• It is much faster to find a term in an index, than to scan all the documents
  10. 10. Inverted Index Example
  11. 11. Ingestion• API / Solr XML, JSON, and javabin/SolrJ• CSV• Relational databases• File system• Web crawl (using Nutch, or others)• Others - XML feeds (e.g. RSS/Atom), e-mail
  12. 12. Solr indexing options
  13. 13. Solr XMLPOST to /update<add> <doc> <field name="id">rawxml1</field> <field name="content_type">text/xml</field> <field name="category">index example</field> <field name="title">Simple Example</field> <field name="filename">addExample.xml</field> <field name="text">A very simple example of adding a document to the index.</field> </doc></add>
  14. 14. Solr JSONPOST to /update/json[ {"id" : "TestDoc1", "title" : "test1"}, {"id" : "TestDoc2", "title" : "another test"}]
  15. 15. CSV indexing• http://localhost:8983/solr/update/csv• Files can be sent over HTTP: • curl http://localhost:8983/solr/update/ csv --data-binary @data.csv -H Content- type:text/plain; charset=utf-8’• or streamed from the file system: • curl http://localhost:8983/solr/update/ csv?stream.file=exampledocs/ data.csv&stream.contentType=text/ plain;charset=utf-8
  16. 16. Rich documents• Solr uses Tika for extraction. Tika is a toolkit for detecting and extracting metadata and structured text content from various document formats using existing parser libraries.• Tika identifies MIME types and then uses the appropriate parser to extract text.• The ExtractingRequestHandler uses Tika to identify types and extract text, and then indexes the extracted text.• The ExtractingRequestHandler is sometimes called "Solr Cell", which stands for Content Extraction Library.• File formats include MS Office, Adobe PDF, XML, HTML, MPEG and many more.
  17. 17. Solr Cell parameters• The literal parameter is very important. • A way to add other fields not indexed using Tika to documents. • &literal.id=12345 • &literal.category=sports• Using curl to index a file on the file system: • curl http://localhost:8983/solr/update/extract? literal.id=doc1&commit=true -F myfile=@tutorial.html• Streaming a file from the file system: • curl "http://localhost:8983/solr/update/extract? stream.file=/some/path/ news.doc&stream.contentType=application/ msword&literal.id=12345"
  18. 18. Streaming remote docs• Streaming a file from a URL: • curl http://localhost:8983/solr/ update/extract? literal.id=123&stream.url=http:// www.solr.com/content/file.pdf -H Content-type:application/pdf’
  19. 19. DataImportHandler• An "in-process" module that can be used to index data directly from relational databases and other data sources• Configuration driven• A tool that can aggregate data from multiple database tables, or even multiple data sources to be indexed as a single Solr document• Provides powerful and customizable data transformation tools• Can do full import or delta import• Pluggable to allow indexing of any type of data source
  20. 20. DIH Examples• Rich documents• Relational database• E-mail
  21. 21. Other commands• <commit/> and <optimize/>• <delete>...</delete> • <id>Q-36</id> • <query>category:electronics</query>• To update a document, simply add a document with same unique key
  22. 22. Configuring Solr• schema.xml • defines field types, fields, and unique key• solrconfig.xml • Lucene settings • request handler, component, and plugin definitions and customizations
  23. 23. Searching Basics• http://localhost:8983/solr/select?q=*:* • q - main query • rows - maximum number of "hits" to return • start - zero-based hit starting point • fl - comma-separated field list • * for all stored fields, score for computed Lucene score
  24. 24. Other Common Search Parameters• sort - specify sort criteria either by field(s) or function(s) in ascending or descending order• fq - filter queries, multiple values supported• wt - writer type - format of Solr response• debugQuery - adds debugging info to response
  25. 25. Filtering results• Use fq to filter results in addition to main query constraints• fq results are independently cached in Solrs filterCache• filter queries do not contribute to ranking scores• Commonly used for filtering on facets
  26. 26. Typical Solr Request• http://localhost:8983/solr/select ?q=ipod &facet=on &facet.field=cat &fq=cat:electronics
  27. 27. Features• Faceting • Distributed search• Highlighting • Replication• Spellchecking • Suggest• More-like-this • Geospatial support• Clustering • UIMA integration• Grouping • Extensible
  28. 28. Integration• Its just HTTP • and CSV, JSON, XML, etc on the requests and responses• Any language or environment can work with Solr easily• Many libraries/layers exist on top
  29. 29. Ruby indexing example
  30. 30. SolrJ searching exampleSolrServer solrServer = new CommonsHttpSolrServer( "http://localhost:8983/solr");SolrQuery query = new SolrQuery();query.setQuery(userQuery);query.setFacet(true);query.setFacetMinCount(1);query.addFacetField("category");QueryResponse queryResponse = solrServer.query(query);
  31. 31. Devilish Details• analysis: tokenization and token filtering• query parsing• relevancy tuning• performance and scalability
  32. 32. SolrMeterhttp://code.google.com/p/solrmeter/
  33. 33. e.g. data.gov
  34. 34. Data.gov CSV catalogURL,Title,Agency,Subagency,Category,Date Released,Date Updated,TimePeriod,Frequency,Description,Data.gov Data Category Type,Specialized Data CategoryDesignation,Keywords,Citation,Agency Program Page,Agency Data Series Page,Unit ofAnalysis,Granularity,Geographic Coverage,Collection Mode,Data CollectionInstrument,Data Dictionary/Variable List,Applicable Agency Information QualityGuideline Designation,Data Quality Certification,Privacy and Confidentiality,TechnicalDocumentation,Additional Metadata,FGDC Compliance (Geospatial Only),StatisticalMethodology,Sampling,Estimation,Weighting,Disclosure Avoidance,QuestionnaireDesign,Series Breaks,Non-response Adjustment,Seasonal Adjustment,StatisticalCharacteristics,Feeds Access Point,Feeds File Size,XML Access Point,XML File Size,CSV/TXT Access Point,CSV/TXT File Size,XLS Access Point,XLS File Size,KML/KMZ AccessPoint,KML File Size,ESRI Access Point,ESRI File Size,Map Access Point,Data ExtractionAccess Point,Widget Access Point"http://www.data.gov/details/4","Next Generation Radar (NEXRAD) Locations","Department of Commerce","National Oceanicand Atmospheric Administration","Geography and Environment","1991","Irregular as needed","1991 to present","Between 4and 10 minutes","This geospatial rendering of weather radar sites gives access to an historical archive of TerminalDoppler Weather Radar data and is used primarily for research purposes. The archived data includes base data andderived products of the National Weather Service (NWS) Weather Surveillance Radar 88 Doppler (WSR-88D) next generation(NEXRAD) weather radar. Weather radar detects the three meteorological base data quantities: reflectivity, mean radialvelocity, and spectrum width. From these quantities, computer processing generates numerous meteorological analysisproducts for forecasts, archiving and dissemination. There are 159 operational NEXRAD radar systems deployedthroughout the United States and at selected overseas locations. At the Radar Operations Center (ROC) in Norman OK,personnel from the NWS, Air Force, Navy, and FAA use this distributed weather radar system to collect the data neededto warn of impending severe weather and possible flash floods; support air traffic safety and assist in the managementof air traffic flow control; facilitate resource protection at military bases; and optimize the management of water,agriculture, forest, and snow removal. This data set is jointly owned by the National Oceanic and AtmosphericAdministration, Federal Aviation Administration, and Department of Defense.","Raw Data Catalog",...
  35. 35. Debugginghttp://localhost:8983/solr/data.gov?q=searching&debugQuery=true
  36. 36. Custom pages• Document detail page• Multiple query intersection comparison with Venn visualization
  37. 37. Document detailhttp://localhost:8983/solr/data.gov/document?id=http%3A%2F%2Fwww.data.gov%2Fdetails%2F61
  38. 38. Query intersection• Just showing off.... how easy it is to do something with a bit of visual impact• Compare three independent queries, intersecting them in a Venn diagram visualization
  39. 39. What now?• Download Solr• "install" it (unzip it)• Start Solr: java -jar start.jar• Ingest your data• Iterate on schema & config• Ship It!
  40. 40. UI / prototyping• Solritas - aka VelocityResponseWriter• Blacklight - projectblacklight.org
  41. 41. Blacklight @ UVa
  42. 42. Blacklight @ Stanford
  43. 43. For more information...• http://www.lucidimagination.com• LucidFind • search Lucene ecosystem: mailing lists, wikis, JIRA, etc • http://search.lucidimagination.com• Getting started with LucidWorks Enterprise: • http://www.lucidimagination.com/products/ lucidworks-search-platform/enterprise• http://lucene.apache.org/solr - wiki, e-mail lists
  44. 44. LucidFindhttp://www.lucidimagination.com/search/?q=user+interface
  45. 45. Thank You!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×