Nov HUG 2009: Searching At Scale


Published on

Jason Rutherglen and Jason Venner presented this talk the Hadoop User Group meetup at the Yahoo! campus in Sunnyvale on 11/18/09. In it, they detail an interesting solution for Searching at scale using Katta, Solr, Lucene and Hadoop.

Published in: Technology

Nov HUG 2009: Searching At Scale

  1. 1. Search at Scale Hadoop, Katta & Solr the smores of Search
  2. 2. Issues <ul><li>Managing the raw data </li></ul><ul><li>Building Indexes </li></ul><ul><li>Handling updates </li></ul><ul><li>Reliable Search </li></ul><ul><li>Search Latency </li></ul>
  3. 3. The Tools <ul><li>HDFS for Raw Data Storage </li></ul><ul><li>SolrIndexWriter for building indexes (Solr-1301) </li></ul><ul><li>Katta for Search Latency </li></ul><ul><li>Katta for Reliable Search </li></ul><ul><li>Brute Force Map/Reduce for index updates </li></ul><ul><li>Near Real Time updates – Jason Rutherglen </li></ul>
  4. 4. HOW-TO SolrRecordWriter <ul><li>Solr config with schema </li></ul><ul><li>An implementation of SolrDocumentConverter </li></ul><ul><li>A Hadoop Cluster you can trash, wrong tuning will crash your machines. </li></ul><ul><li>ZipFile Output – some compression, reduces the number of files in your hdfs, easy deployment. Use jar xf to unpack, zip will fail. </li></ul>
  5. 5. SolrRecordWriter and your Cluster <ul><li>Each SolrRecordWriter instance uses substantial quantities of system resources: </li></ul><ul><li>Processor – analyzing the input records </li></ul><ul><li>Memory – buffering the processed index records </li></ul><ul><li>IOP, optimize saturates storage devices </li></ul><ul><li>Be very careful in how many instances you have running per machine. </li></ul>
  6. 6. Katta <ul><li>Distributed Search </li></ul><ul><li>Replicated Indexes </li></ul><ul><li>Fault Tolerant </li></ul><ul><li>Direct deployment from hdfs </li></ul>
  7. 7. Katta Issues <ul><li>Solr is a pig, run few instances per machine. </li></ul><ul><li>Large indexes can take time to copy in and start, consuming substantial io resources </li></ul><ul><li>Use hftp: to reference your indexes, passes through firewalls and hdfs version independent. </li></ul><ul><li>Use one of the balancing distribution policies </li></ul><ul><li>Nodes don’t handle Solr OOMs gracefully </li></ul>
  8. 8. Search Latency <ul><li>Run as many replicas of your indexes as needed to ensure that your latency is low enough </li></ul><ul><li>Run as many solr front ends to manage latency. </li></ul>
  9. 9. Solr Issues <ul><li>Poorly chosen facets can cause OOMs be careful </li></ul><ul><li>Solr is slow to start, so rolling new indexes in takes time </li></ul><ul><li>Solr is a black box to Katta, unlike Lucene which is intimate. </li></ul>
  10. 10. Updates <ul><li>Brute Force, rebuild the entire corpus and redeploy </li></ul><ul><li>Distribute updates to deployed indexes (not implemented) </li></ul><ul><li>Merge indexes (Jason Rutherglen) </li></ul><ul><li>Distribute new indexes and handle merge in the fronting solr intances (not implemented) </li></ul>
  11. 11. Code and Configurations <ul><li>We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines. </li></ul><ul><li>We give each kata node jvm 4gig of heap. </li></ul><ul><li>I run 1-3 solr front end instances with 6gig of heap, </li></ul><ul><li>Code and configurations will be on , for members. </li></ul>