Rapid Prototyping
           with Solr

          Erik Hatcher, Lucid Imagination
erik.hatcher @ lucidimagination.com, May 25, 2011
Abstract
§  Got data? Let's make it searchable! This interactive
    presentation will demonstrate getting documents into
    Solr quickly, will provide some tips in adjusting Solr's
    schema to match your needs better, and finally will
    discuss how showcase your data in a flexible search
    user interface. We'll see how to rapidly leverage
    faceting, highlighting, spell checking, and debugging.
    Even after all that, there will be enough time left to
    outline the next steps in developing your search
    application and taking it to production.




                                                               3
My Background
§  Erik Hatcher
   •  Lucid Imagination
      §  Technical Staff
   •  Co-author
      §  Java Development with Ant / Ant in Action (Manning)
      §  Lucene in Action (Manning)
   •  Apache Software Foundation
      §  Committer – Lucene / Solr
      §  PMC – Lucene TLP
      §  Member




                                                                4
Why prototype?
§  Demonstrate Solr can handle your data and
    searching needs; mitigate risk, learn the
    unknown
§  It’s quick and easy, with very little time
    investment
§  Immediate functional user interface impresses
    decision makers and target users;
    get buy-in
  •  The user interface IS the app



                                                    5
Prior Art
§  Hoss’ amazing ISFDB work
   •  http://www.lucidimagination.com/blog/tag/isfdb/
§  Previous “Rapid Prototyping with Solr” presentations
   •  Data.gov Catalog on Solr:
      http://www.lucidimagination.com/blog/2010/11/05/data-gov-
      on-solr/
   •  Rich text files on Solr:
      http://www.lucidimagination.com/Community/Hear-from-
      the-Experts/Podcasts-and-Videos/Rapid-Prototyping-
      Search-Applications-Solr
   •  CSV (conference attendee data) on Solr:
      http://www.slideshare.net/erikhatcher/rapid-prototyping-
      with-solr-4312681



                                                                  6
Rapid Prototyping using CSV
§  Fired up Solr’s example configuration
§  /update/csv
   •  http://localhost:8983/solr/update/csv?
      commit=true&stream.file=EuroCon2010.csv&fieldnames=fi
      rst,last,company,title,country&header=true&f.country.map
      =Great+Britain:United+Kingdom
§  Tweak configuration
   •  schema: domain-centric field names
   •  solrconfig: /browse request handler
   •  Template adjustments
§  Instant classic search results view, tree map
    visualization of facet data, and random selection of
    contest winners

                                                                 7
CSV results




              8
… using rich text files
§  curl "http://localhost:8983 /solr/update/extract?
    stream.file=/docs/file.pdf &literal.id=/docs/file.pdf




                                                            9
… using Data.Gov catalog data
§  /update/csv – again!




                                 10
Explaining




             11
Suggest




          12
Venn Viz




           13
E-commerce data
§  http://bbyopen.com/
§  Product data, via easy HTTP JSON API




                                           14
Ingesting the data
require 'solr’!
#...!
1.upto(max_pages) do |page|!
  puts "Processing page #{page}"!
  json = fetch_page(page)!
  !
  response = JSON.parse(json, :symbolize_names=>true)!
  puts "Total products: #{response[:total]}" if page == 1!
!
  mapping = {!
     :id           => :sku,!
     :name_t       => :name,!
     :thumbnail_s => :thumbnailImage,!
     :url_s        => :url,!
     :type_s       => :type,!
     :category_s   => Proc.new {|prod| !
                        prod[:categoryPath].collect {|cat| cat[:name]}.join(' >> ')},!
     :department_s => :department,!
     :class_s      => :class,!
     :subclass_s   => :subclass,!
     :sale_price_f => :salePrice!
  }!
!
  Solr::Indexer.new(response[:products], mapping, !
                     {:debug => debug, :buffer_docs => 500}).index!
end!



                                                                                         15
solr-ruby’s secret power
§  Solr::Indexer.new(
        source, mapping, options
    ).index
§  “Quacks like a duck”
§  source simply #each’s
§  mapping simply #[]’s




                                   16
… on Prism




             17
What is Prism?
§  Yet another opinionated brainstorm from Erik
§  https://github.com/lucidimagination/Prism
§  Under the covers
    •  Ruby
        §  because it’s beautiful
    •  Sinatra
        §  to be lightweight and have elegant flexible routing
    •  Velocity
        §  because it is easy to learn and use, and has powerful features, facilitates
            edit/refresh work
§  Separate from Solr, Rack-savvy, allows easy coding of new routes
    and capabilities
§  Designed to work with any arbitrary Solr instance, and already has
    some basic LucidWorks Enterprise capability
§  Totally a proof-of-concept at this point – just a quick hack

                                                                                          18
… on Solritas




                19
Solritas?
§  Pronounced: so-LAIR-uh-toss
§  Celeritas is a Latin word, translated as "swiftness" or
    "speed". It is often given as the origin of the symbol c,
    the universal notation for the speed of light - http://
    en.wikipedia.org/wiki/Celeritas
§  Technically it’s the VelocityResponseWriter
    (wt=velocity)
   •  simply passes the Solr response through the Apache
      Velocity templating engine
§  http://wiki.apache.org/solr/VelocityResponseWriter
§  Built into Solr, available instantly out of the box at:
    http://localhost:8983/solr/browse

                                                                20
… on Blacklight




                  21
Blacklight?
§  http://projectblacklight.org/
§  Blacklight is a free and open source Ruby on Rails based
    discovery interface (a.k.a. “next-generation catalog”) especially
    optimized for heterogeneous collections. You can use it as a library
    catalog, as a front end for a digital repository, or as a single-search
    interface to aggregate digital content that would otherwise be
    siloed.
§  Production sites:
       •  http://search.lib.virginia.edu/
       •  http://searchworks.stanford.edu/
§    Features:
       •  Authentication
       •  Saved searches
       •  Bookmarks – saved result items
       •  Selected items – for exporting to 3rd party systems
       •  Customizable / extensible UI

                                                                              22
Prototyping Tips and Tools
§  Get data into Solr in the simplest possible way
    •  CSV – if it fits, it’s really nice
§  Schema adjusting
    •  <dynamicField name="*" type="string" multiValued="true"/>
    •  <copyField source="*" dest="text"/>
§  Data analysis
    •  Understand what Solr is doing with your fields
    •  Solr’s Schema Browser and /admin/luke request handler
§  UI
    •  /browse – easy tweaking of <solr-home>/conf/velocity/*.vm
       templates




                                                                   23
Now what?
§  Script the indexing process: full and
    incremental/delta
§  Work with real users on real needs
§  Integrate into production systems
§  Iterate on schema enhancements and
    configuration tweaks
§  Deploy to staging/production environments and
    work at scale: collection size, real queries and
    volume, hardware and JVM settings


                                                       24
Test
§    Performance
§    Scalability
§    Relevance
§    Automate all of the above, start baselines,
      avoid regressions




                                                    25
Thanks!




          26

Rapid Prototyping with Solr

  • 1.
    Rapid Prototyping with Solr Erik Hatcher, Lucid Imagination erik.hatcher @ lucidimagination.com, May 25, 2011
  • 2.
    Abstract §  Got data?Let's make it searchable! This interactive presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production. 3
  • 3.
    My Background §  ErikHatcher •  Lucid Imagination §  Technical Staff •  Co-author §  Java Development with Ant / Ant in Action (Manning) §  Lucene in Action (Manning) •  Apache Software Foundation §  Committer – Lucene / Solr §  PMC – Lucene TLP §  Member 4
  • 4.
    Why prototype? §  DemonstrateSolr can handle your data and searching needs; mitigate risk, learn the unknown §  It’s quick and easy, with very little time investment §  Immediate functional user interface impresses decision makers and target users; get buy-in •  The user interface IS the app 5
  • 5.
    Prior Art §  Hoss’amazing ISFDB work •  http://www.lucidimagination.com/blog/tag/isfdb/ §  Previous “Rapid Prototyping with Solr” presentations •  Data.gov Catalog on Solr: http://www.lucidimagination.com/blog/2010/11/05/data-gov- on-solr/ •  Rich text files on Solr: http://www.lucidimagination.com/Community/Hear-from- the-Experts/Podcasts-and-Videos/Rapid-Prototyping- Search-Applications-Solr •  CSV (conference attendee data) on Solr: http://www.slideshare.net/erikhatcher/rapid-prototyping- with-solr-4312681 6
  • 6.
    Rapid Prototyping usingCSV §  Fired up Solr’s example configuration §  /update/csv •  http://localhost:8983/solr/update/csv? commit=true&stream.file=EuroCon2010.csv&fieldnames=fi rst,last,company,title,country&header=true&f.country.map =Great+Britain:United+Kingdom §  Tweak configuration •  schema: domain-centric field names •  solrconfig: /browse request handler •  Template adjustments §  Instant classic search results view, tree map visualization of facet data, and random selection of contest winners 7
  • 7.
  • 8.
    … using richtext files §  curl "http://localhost:8983 /solr/update/extract? stream.file=/docs/file.pdf &literal.id=/docs/file.pdf 9
  • 9.
    … using Data.Govcatalog data §  /update/csv – again! 10
  • 10.
  • 11.
  • 12.
  • 13.
    E-commerce data §  http://bbyopen.com/ § Product data, via easy HTTP JSON API 14
  • 14.
    Ingesting the data require'solr’! #...! 1.upto(max_pages) do |page|! puts "Processing page #{page}"! json = fetch_page(page)! ! response = JSON.parse(json, :symbolize_names=>true)! puts "Total products: #{response[:total]}" if page == 1! ! mapping = {! :id => :sku,! :name_t => :name,! :thumbnail_s => :thumbnailImage,! :url_s => :url,! :type_s => :type,! :category_s => Proc.new {|prod| ! prod[:categoryPath].collect {|cat| cat[:name]}.join(' >> ')},! :department_s => :department,! :class_s => :class,! :subclass_s => :subclass,! :sale_price_f => :salePrice! }! ! Solr::Indexer.new(response[:products], mapping, ! {:debug => debug, :buffer_docs => 500}).index! end! 15
  • 15.
    solr-ruby’s secret power § Solr::Indexer.new( source, mapping, options ).index §  “Quacks like a duck” §  source simply #each’s §  mapping simply #[]’s 16
  • 16.
  • 17.
    What is Prism? § Yet another opinionated brainstorm from Erik §  https://github.com/lucidimagination/Prism §  Under the covers •  Ruby §  because it’s beautiful •  Sinatra §  to be lightweight and have elegant flexible routing •  Velocity §  because it is easy to learn and use, and has powerful features, facilitates edit/refresh work §  Separate from Solr, Rack-savvy, allows easy coding of new routes and capabilities §  Designed to work with any arbitrary Solr instance, and already has some basic LucidWorks Enterprise capability §  Totally a proof-of-concept at this point – just a quick hack 18
  • 18.
  • 19.
    Solritas? §  Pronounced: so-LAIR-uh-toss § Celeritas is a Latin word, translated as "swiftness" or "speed". It is often given as the origin of the symbol c, the universal notation for the speed of light - http:// en.wikipedia.org/wiki/Celeritas §  Technically it’s the VelocityResponseWriter (wt=velocity) •  simply passes the Solr response through the Apache Velocity templating engine §  http://wiki.apache.org/solr/VelocityResponseWriter §  Built into Solr, available instantly out of the box at: http://localhost:8983/solr/browse 20
  • 20.
  • 21.
    Blacklight? §  http://projectblacklight.org/ §  Blacklightis a free and open source Ruby on Rails based discovery interface (a.k.a. “next-generation catalog”) especially optimized for heterogeneous collections. You can use it as a library catalog, as a front end for a digital repository, or as a single-search interface to aggregate digital content that would otherwise be siloed. §  Production sites: •  http://search.lib.virginia.edu/ •  http://searchworks.stanford.edu/ §  Features: •  Authentication •  Saved searches •  Bookmarks – saved result items •  Selected items – for exporting to 3rd party systems •  Customizable / extensible UI 22
  • 22.
    Prototyping Tips andTools §  Get data into Solr in the simplest possible way •  CSV – if it fits, it’s really nice §  Schema adjusting •  <dynamicField name="*" type="string" multiValued="true"/> •  <copyField source="*" dest="text"/> §  Data analysis •  Understand what Solr is doing with your fields •  Solr’s Schema Browser and /admin/luke request handler §  UI •  /browse – easy tweaking of <solr-home>/conf/velocity/*.vm templates 23
  • 23.
    Now what? §  Scriptthe indexing process: full and incremental/delta §  Work with real users on real needs §  Integrate into production systems §  Iterate on schema enhancements and configuration tweaks §  Deploy to staging/production environments and work at scale: collection size, real queries and volume, hardware and JVM settings 24
  • 24.
    Test §  Performance §  Scalability §  Relevance §  Automate all of the above, start baselines, avoid regressions 25
  • 25.