Search at Scale Hadoop, Katta & Solr the smores of Search
Issues <ul><li>Managing the raw data </li></ul><ul><li>Building Indexes </li></ul><ul><li>Handling updates </li></ul><ul><...
The Tools <ul><li>HDFS for Raw Data Storage </li></ul><ul><li>SolrIndexWriter for building indexes (Solr-1301) </li></ul><...
HOW-TO SolrRecordWriter <ul><li>Solr config with schema </li></ul><ul><li>An implementation of SolrDocumentConverter </li>...
SolrRecordWriter and your Cluster <ul><li>Each SolrRecordWriter instance uses substantial quantities of system resources: ...
Katta <ul><li>Distributed Search </li></ul><ul><li>Replicated Indexes </li></ul><ul><li>Fault Tolerant </li></ul><ul><li>D...
Katta Issues <ul><li>Solr is a pig, run few instances per machine. </li></ul><ul><li>Large indexes can take time to copy i...
Search Latency <ul><li>Run as many replicas of your indexes as needed to ensure that your latency is low enough </li></ul>...
Solr Issues <ul><li>Poorly chosen facets can cause OOMs be careful </li></ul><ul><li>Solr is slow to start, so rolling new...
Updates <ul><li>Brute Force, rebuild the entire corpus and redeploy </li></ul><ul><li>Distribute updates to deployed index...
Code and Configurations <ul><li>We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines. ...
Upcoming SlideShare
Loading in...5
×

Searching At Scale

2,793

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,793
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Searching At Scale

  1. 1. Search at Scale Hadoop, Katta & Solr the smores of Search
  2. 2. Issues <ul><li>Managing the raw data </li></ul><ul><li>Building Indexes </li></ul><ul><li>Handling updates </li></ul><ul><li>Reliable Search </li></ul><ul><li>Search Latency </li></ul>
  3. 3. The Tools <ul><li>HDFS for Raw Data Storage </li></ul><ul><li>SolrIndexWriter for building indexes (Solr-1301) </li></ul><ul><li>Katta for Search Latency </li></ul><ul><li>Katta for Reliable Search </li></ul><ul><li>Brute Force Map/Reduce for index updates </li></ul><ul><li>Near Real Time updates – Jason Rutherglen </li></ul>
  4. 4. HOW-TO SolrRecordWriter <ul><li>Solr config with schema </li></ul><ul><li>An implementation of SolrDocumentConverter </li></ul><ul><li>A Hadoop Cluster you can trash, wrong tuning will crash your machines. </li></ul><ul><li>ZipFile Output – some compression, reduces the number of files in your hdfs, easy deployment. Use jar xf to unpack, zip will fail. </li></ul>
  5. 5. SolrRecordWriter and your Cluster <ul><li>Each SolrRecordWriter instance uses substantial quantities of system resources: </li></ul><ul><li>Processor – analyzing the input records </li></ul><ul><li>Memory – buffering the processed index records </li></ul><ul><li>IOP, optimize saturates storage devices </li></ul><ul><li>Be very careful in how many instances you have running per machine. </li></ul>
  6. 6. Katta <ul><li>Distributed Search </li></ul><ul><li>Replicated Indexes </li></ul><ul><li>Fault Tolerant </li></ul><ul><li>Direct deployment from hdfs </li></ul>
  7. 7. Katta Issues <ul><li>Solr is a pig, run few instances per machine. </li></ul><ul><li>Large indexes can take time to copy in and start, consuming substantial io resources </li></ul><ul><li>Use hftp: to reference your indexes, passes through firewalls and hdfs version independent. </li></ul><ul><li>Use one of the balancing distribution policies </li></ul><ul><li>Nodes don’t handle Solr OOMs gracefully </li></ul>
  8. 8. Search Latency <ul><li>Run as many replicas of your indexes as needed to ensure that your latency is low enough </li></ul><ul><li>Run as many solr front ends to manage latency. </li></ul>
  9. 9. Solr Issues <ul><li>Poorly chosen facets can cause OOMs be careful </li></ul><ul><li>Solr is slow to start, so rolling new indexes in takes time </li></ul><ul><li>Solr is a black box to Katta, unlike Lucene which is intimate. </li></ul>
  10. 10. Updates <ul><li>Brute Force, rebuild the entire corpus and redeploy </li></ul><ul><li>Distribute updates to deployed indexes (not implemented) </li></ul><ul><li>Merge indexes (Jason Rutherglen) </li></ul><ul><li>Distribute new indexes and handle merge in the fronting solr intances (not implemented) </li></ul>
  11. 11. Code and Configurations <ul><li>We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines. </li></ul><ul><li>We give each kata node jvm 4gig of heap. </li></ul><ul><li>I run 1-3 solr front end instances with 6gig of heap, </li></ul><ul><li>Code and configurations will be on www.prohadoop.com , for members. </li></ul>

×