Rapid Prototyping with SolrPresentation Transcript
Rapid Prototyping with Solr NFJS - Raleigh, August 2011 Presented by Erik Hatchererik.firstname.lastname@example.org Lucid Imagination http://www.lucidimagination.com
About me...• Co-author, "Lucene in Action" (and "Java Development with Ant" / "Ant in Action" once upon a time)• "Apache guy" - Lucene/Solr committer; member of Lucene PMC, member of Apache Software Foundation• Co-founder, evangelist, trainer, coder @ Lucid Imagination
About Lucid Imagination...• Lucid Imagination provides commercial-grade support, training, high-level consulting and value- added software for Lucene and Solr.• We make Lucene ‘enterprise-ready’ by offering: • Free, certiﬁed, distributions and downloads. • Support, training, and consulting. • LucidWorks Enterprise, a commercial search platform built on top of Solr.
AbstractGot data? Lets make it searchable! Rapid Prototyping withSolr will demonstrate getting documents into Solr quickly,provide some tips in adjusting Solrs schema to match yourneeds better, and ﬁnally will discuss how to showcase your data in a ﬂexible search user interface. Well see how to rapidly leverage faceting, highlighting, spell checking, anddebugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
What is Lucene?• An open source Java-based IR library with best practice indexing and query capabilities, fast and lightweight search and indexing.• 100% Java (.NET, Perl and other versions too).• Stable, mature API.• Continuously improved and tuned over more than 10 years.• Cleanly implemented, easy to embed in an application.• Compact, portable index representation.• Programmable text analyzers, spell checking and highlighting.• Not a crawler or a text extraction tool.
Lucenes History• Created by Doug Cutting in 1999 • built on ideas from search projects Doug created at Xerox PARC and Apple.• Donated to the Apache Software Foundation (ASF) in 2001.• Became an Apache top-level project in 2005.• Has grown and morphed through the years and is now both: • A search library. • An ASF Top-Level Project (TLP) encompassing several sub-projects.• Lucene and Solr "merged" development in early 2010.
What is Solr?• An open source search engine.• Indexes content sources, processes query requests, returns search results.• Uses Lucene as the "engine", but adds full enterprise search server features and capabilities.• A web-based application that processes HTTP requests and returns HTTP responses.• Initially started in 2004 and developed by CNET as an in-house project to add search capability for the company website.• Donated to ASF in 2006.
What Version of Solr?• There’s more than one answer!• The current, released, stable version is 3.3• The development release is referred to as “trunk”. • This is where the new, less tested work goes on • Also referred to as 4.0• LucidWorks Enterprise is built on a trunk snapshot + additional features.
Why prototype?• Demonstrate Solr can handle your needs• Stake/purse-holder buy-in• Its quick, easy, and fun!• The User Interface is the app
Schema tinkering• Removed all example ﬁeld deﬁnitions• Uncomment and adjust catch-all dynamic ﬁeld: • <dynamicField name="*" type="string" multiValued="false"/>• Ensure uniqueKey is appropriate • unusual in this example, disabled it• Make every document/ﬁeld fully searchable! • <copyField source="*" dest="text"/>
After adjusting conﬁg...• Restart Solr• Or... reload the core (when in multicore mode)
Clean import# Delete all documentscurl "http://localhost:8983/solr/update?stream.body=%3Cdelete%3E%3Cquery %3E*:*%3C/query%3E%3C/delete%3E&commit=true"# Index your datacurl "http://localhost:8983/solr/update/csv?commit=true&stream.file=EuroCon2010.csv&fieldnames=first,last, company,title,country&header=true"
Data.gov CSV catalogURL,Title,Agency,Subagency,Category,Date Released,Date Updated,TimePeriod,Frequency,Description,Data.gov Data Category Type,Specialized Data CategoryDesignation,Keywords,Citation,Agency Program Page,Agency Data Series Page,Unit ofAnalysis,Granularity,Geographic Coverage,Collection Mode,Data CollectionInstrument,Data Dictionary/Variable List,Applicable Agency Information QualityGuideline Designation,Data Quality Certification,Privacy and Confidentiality,TechnicalDocumentation,Additional Metadata,FGDC Compliance (Geospatial Only),StatisticalMethodology,Sampling,Estimation,Weighting,Disclosure Avoidance,QuestionnaireDesign,Series Breaks,Non-response Adjustment,Seasonal Adjustment,StatisticalCharacteristics,Feeds Access Point,Feeds File Size,XML Access Point,XML File Size,CSV/TXT Access Point,CSV/TXT File Size,XLS Access Point,XLS File Size,KML/KMZ AccessPoint,KML File Size,ESRI Access Point,ESRI File Size,Map Access Point,Data ExtractionAccess Point,Widget Access Point"http://www.data.gov/details/4","Next Generation Radar (NEXRAD) Locations","Department of Commerce","National Oceanicand Atmospheric Administration","Geography and Environment","1991","Irregular as needed","1991 to present","Between 4and 10 minutes","This geospatial rendering of weather radar sites gives access to an historical archive of TerminalDoppler Weather Radar data and is used primarily for research purposes. The archived data includes base data andderived products of the National Weather Service (NWS) Weather Surveillance Radar 88 Doppler (WSR-88D) next generation(NEXRAD) weather radar. Weather radar detects the three meteorological base data quantities: reflectivity, mean radialvelocity, and spectrum width. From these quantities, computer processing generates numerous meteorological analysisproducts for forecasts, archiving and dissemination. There are 159 operational NEXRAD radar systems deployedthroughout the United States and at selected overseas locations. At the Radar Operations Center (ROC) in Norman OK,personnel from the NWS, Air Force, Navy, and FAA use this distributed weather radar system to collect the data neededto warn of impending severe weather and possible flash floods; support air traffic safety and assist in the managementof air traffic flow control; facilitate resource protection at military bases; and optimize the management of water,agriculture, forest, and snow removal. This data set is jointly owned by the National Oceanic and AtmosphericAdministration, Federal Aviation Administration, and Department of Defense.","Raw Data Catalog",...
Splitting keywords• CSV handler: f.keywords.split=true • stored values are split, multivalued• Or via schema • Stored value remains as in original, single valued<fieldType name="comma_separated" class="solr.TextField" omitNorms="true"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="s*,s*"/> </analyzer></fieldType>...<field name="keywords" type="comma_separated" indexed="true" stored="true"/>
Suggest • Suggest terms as user types in search box • Technique: jQuery autocomplete, Solr’s TermsComponent,Velocity templatehttp://localhost:8983/solr/terms?terms.fl=suggest&terms.prefix=sola&terms.sort=count&wt=velocity&v.template=suggest #foreach($t in $response.response.terms.suggest) $t.key #end
Solritas• Pronounced: so-LAIR-uh-toss• Celeritas is a Latin word, translated as "swiftness" or "speed". It is often given as the origin of the symbol c, the universal notation for the speed of light - http:// en.wikipedia.org/wiki/Celeritas• VelocityResponseWriter - simply passes the Solr response through the Apache Velocity templating engine• http://wiki.apache.org/solr/VelocityResponseWriter
Solr Flare• Ruby on Rails plugin• facet ﬁeld detection, autosuggest, saved search, inverted facets, pie charts, Simile Timeline and Exhibit integration• Useful for rapid prototyping• See Flares big brother, Blacklight, for production quality
Tang on Flare
• UVA radiation = blacklight• libraries are much more than books• opinionated • Ruby on Rails: best choice for an extensible user interface development framework
Test• Performance• Scalability• Relevance• Automate all of the above, start baselines early, avoid regressions
Then what?• Script the indexing process: full & delta• Work with real users on actual needs• Integrate with production systems• Iterate on schema enhancements, conﬁguration tweaks such as caching• Deploy to staging/production environments and work at scale: collection size, real queries and performance, hardware and JVM settings