Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Searching At Scale


Published on

Published in: Technology
  • Be the first to comment

Searching At Scale

  1. 1. Search at Scale Hadoop, Katta & Solr the smores of Search
  2. 2. Issues <ul><li>Managing the raw data </li></ul><ul><li>Building Indexes </li></ul><ul><li>Handling updates </li></ul><ul><li>Reliable Search </li></ul><ul><li>Search Latency </li></ul>
  3. 3. The Tools <ul><li>HDFS for Raw Data Storage </li></ul><ul><li>SolrIndexWriter for building indexes (Solr-1301) </li></ul><ul><li>Katta for Search Latency </li></ul><ul><li>Katta for Reliable Search </li></ul><ul><li>Brute Force Map/Reduce for index updates </li></ul><ul><li>Near Real Time updates – Jason Rutherglen </li></ul>
  4. 4. HOW-TO SolrRecordWriter <ul><li>Solr config with schema </li></ul><ul><li>An implementation of SolrDocumentConverter </li></ul><ul><li>A Hadoop Cluster you can trash, wrong tuning will crash your machines. </li></ul><ul><li>ZipFile Output – some compression, reduces the number of files in your hdfs, easy deployment. Use jar xf to unpack, zip will fail. </li></ul>
  5. 5. SolrRecordWriter and your Cluster <ul><li>Each SolrRecordWriter instance uses substantial quantities of system resources: </li></ul><ul><li>Processor – analyzing the input records </li></ul><ul><li>Memory – buffering the processed index records </li></ul><ul><li>IOP, optimize saturates storage devices </li></ul><ul><li>Be very careful in how many instances you have running per machine. </li></ul>
  6. 6. Katta <ul><li>Distributed Search </li></ul><ul><li>Replicated Indexes </li></ul><ul><li>Fault Tolerant </li></ul><ul><li>Direct deployment from hdfs </li></ul>
  7. 7. Katta Issues <ul><li>Solr is a pig, run few instances per machine. </li></ul><ul><li>Large indexes can take time to copy in and start, consuming substantial io resources </li></ul><ul><li>Use hftp: to reference your indexes, passes through firewalls and hdfs version independent. </li></ul><ul><li>Use one of the balancing distribution policies </li></ul><ul><li>Nodes don’t handle Solr OOMs gracefully </li></ul>
  8. 8. Search Latency <ul><li>Run as many replicas of your indexes as needed to ensure that your latency is low enough </li></ul><ul><li>Run as many solr front ends to manage latency. </li></ul>
  9. 9. Solr Issues <ul><li>Poorly chosen facets can cause OOMs be careful </li></ul><ul><li>Solr is slow to start, so rolling new indexes in takes time </li></ul><ul><li>Solr is a black box to Katta, unlike Lucene which is intimate. </li></ul>
  10. 10. Updates <ul><li>Brute Force, rebuild the entire corpus and redeploy </li></ul><ul><li>Distribute updates to deployed indexes (not implemented) </li></ul><ul><li>Merge indexes (Jason Rutherglen) </li></ul><ul><li>Distribute new indexes and handle merge in the fronting solr intances (not implemented) </li></ul>
  11. 11. Code and Configurations <ul><li>We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines. </li></ul><ul><li>We give each kata node jvm 4gig of heap. </li></ul><ul><li>I run 1-3 solr front end instances with 6gig of heap, </li></ul><ul><li>Code and configurations will be on , for members. </li></ul>