Rapid prototyping with solr - By Erik Hatcher

Rapid Prototyping
with Solr

Erik Hatcher, Lucid Imagination
erik.hatcher @ lucidimagination.com, May 25, 2011

Abstract
§  Got data? Let's make it searchable! This interactive
presentation will demonstrate getting documents into
Solr quickly, will provide some tips in adjusting Solr's
schema to match your needs better, and finally will
discuss how showcase your data in a flexible search
user interface. We'll see how to rapidly leverage
faceting, highlighting, spell checking, and debugging.
Even after all that, there will be enough time left to
outline the next steps in developing your search
application and taking it to production.

3

My Background
§  Erik Hatcher
•  Lucid Imagination
§  Technical Staff
•  Co-author
§  Java Development with Ant / Ant in Action (Manning)
§  Lucene in Action (Manning)
•  Apache Software Foundation
§  Committer – Lucene / Solr
§  PMC – Lucene TLP
§  Member

4

Why prototype?
§  Demonstrate Solr can handle your data and
searching needs; mitigate risk, learn the
unknown
§  It’s quick and easy, with very little time
investment
§  Immediate functional user interface impresses
decision makers and target users;
get buy-in
•  The user interface IS the app

5

Prior Art
§  Hoss’ amazing ISFDB work
•  http://www.lucidimagination.com/blog/tag/isfdb/
§  Previous “Rapid Prototyping with Solr” presentations
•  Data.gov Catalog on Solr:
http://www.lucidimagination.com/blog/2010/11/05/data-gov-
on-solr/
•  Rich text files on Solr:
http://www.lucidimagination.com/Community/Hear-from-
the-Experts/Podcasts-and-Videos/Rapid-Prototyping-
Search-Applications-Solr
•  CSV (conference attendee data) on Solr:
http://www.slideshare.net/erikhatcher/rapid-prototyping-
with-solr-4312681

6

Rapid Prototyping using CSV
§  Fired up Solr’s example configuration
§  /update/csv
•  http://localhost:8983/solr/update/csv?
commit=true&stream.file=EuroCon2010.csv&fieldnames=fi
rst,last,company,title,country&header=true&f.country.map
=Great+Britain:United+Kingdom
§  Tweak configuration
•  schema: domain-centric field names
•  solrconfig: /browse request handler
•  Template adjustments
§  Instant classic search results view, tree map
visualization of facet data, and random selection of
contest winners

7

… using rich text files
§  curl "http://localhost:8983 /solr/update/extract?
stream.file=/docs/file.pdf &literal.id=/docs/file.pdf

9

… using Data.Gov catalog data
§  /update/csv – again!

10

E-commerce data
§  http://bbyopen.com/
§  Product data, via easy HTTP JSON API

14

Ingesting the data
require 'solr’!
#...!
1.upto(max_pages) do |page|!
puts "Processing page #{page}"!
json = fetch_page(page)!
!
response = JSON.parse(json, :symbolize_names=>true)!
puts "Total products: #{response[:total]}" if page == 1!
!
mapping = {!
:id => :sku,!
:name_t => :name,!
:thumbnail_s => :thumbnailImage,!
:url_s => :url,!
:type_s => :type,!
:category_s => Proc.new {|prod| !
prod[:categoryPath].collect {|cat| cat[:name]}.join(' >> ')},!
:department_s => :department,!
:class_s => :class,!
:subclass_s => :subclass,!
:sale_price_f => :salePrice!
}!
!
Solr::Indexer.new(response[:products], mapping, !
{:debug => debug, :buffer_docs => 500}).index!
end!

15

solr-ruby’s secret power
§  Solr::Indexer.new(
source, mapping, options
).index
§  “Quacks like a duck”
§  source simply #each’s
§  mapping simply #[]’s

16

What is Prism?
§  Yet another opinionated brainstorm from Erik
§  https://github.com/lucidimagination/Prism
§  Under the covers
•  Ruby
§  because it’s beautiful
•  Sinatra
§  to be lightweight and have elegant flexible routing
•  Velocity
§  because it is easy to learn and use, and has powerful features, facilitates
edit/refresh work
§  Separate from Solr, Rack-savvy, allows easy coding of new routes
and capabilities
§  Designed to work with any arbitrary Solr instance, and already has
some basic LucidWorks Enterprise capability
§  Totally a proof-of-concept at this point – just a quick hack

18

Solritas?
§  Pronounced: so-LAIR-uh-toss
§  Celeritas is a Latin word, translated as "swiftness" or
"speed". It is often given as the origin of the symbol c,
the universal notation for the speed of light - http://
en.wikipedia.org/wiki/Celeritas
§  Technically it’s the VelocityResponseWriter
(wt=velocity)
•  simply passes the Solr response through the Apache
Velocity templating engine
§  http://wiki.apache.org/solr/VelocityResponseWriter
§  Built into Solr, available instantly out of the box at:
http://localhost:8983/solr/browse

20

… on Blacklight

21

Blacklight?
§  http://projectblacklight.org/
§  Blacklight is a free and open source Ruby on Rails based
discovery interface (a.k.a. “next-generation catalog”) especially
optimized for heterogeneous collections. You can use it as a library
catalog, as a front end for a digital repository, or as a single-search
interface to aggregate digital content that would otherwise be
siloed.
§  Production sites:
•  http://search.lib.virginia.edu/
•  http://searchworks.stanford.edu/
§  Features:
•  Authentication
•  Saved searches
•  Bookmarks – saved result items
•  Selected items – for exporting to 3rd party systems
•  Customizable / extensible UI

22

Prototyping Tips and Tools
§  Get data into Solr in the simplest possible way
•  CSV – if it fits, it’s really nice
§  Schema adjusting
•  <dynamicField name="*" type="string" multiValued="true"/>
•  <copyField source="*" dest="text"/>
§  Data analysis
•  Understand what Solr is doing with your fields
•  Solr’s Schema Browser and /admin/luke request handler
§  UI
•  /browse – easy tweaking of <solr-home>/conf/velocity/*.vm
templates

23

Now what?
§  Script the indexing process: full and
incremental/delta
§  Work with real users on real needs
§  Integrate into production systems
§  Iterate on schema enhancements and
configuration tweaks
§  Deploy to staging/production environments and
work at scale: collection size, real queries and
volume, hardware and JVM settings

24

Test
§  Performance
§  Scalability
§  Relevance
§  Automate all of the above, start baselines,
avoid regressions

25

Rapid prototyping with solr - By Erik Hatcher

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Rapid prototyping with solr - By Erik Hatcher

Similar to Rapid prototyping with solr - By Erik Hatcher (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Rapid prototyping with solr - By Erik Hatcher