• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,638
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Search at Scale Hadoop, Katta & Solr the smores of Search
  • 2. Issues
    • Managing the raw data
    • Building Indexes
    • Handling updates
    • Reliable Search
    • Search Latency
  • 3. The Tools
    • HDFS for Raw Data Storage
    • SolrIndexWriter for building indexes (Solr-1301)
    • Katta for Search Latency
    • Katta for Reliable Search
    • Brute Force Map/Reduce for index updates
    • Near Real Time updates – Jason Rutherglen
  • 4. HOW-TO SolrRecordWriter
    • Solr config with schema
    • An implementation of SolrDocumentConverter
    • A Hadoop Cluster you can trash, wrong tuning will crash your machines.
    • ZipFile Output – some compression, reduces the number of files in your hdfs, easy deployment. Use jar xf to unpack, zip will fail.
  • 5. SolrRecordWriter and your Cluster
    • Each SolrRecordWriter instance uses substantial quantities of system resources:
    • Processor – analyzing the input records
    • Memory – buffering the processed index records
    • IOP, optimize saturates storage devices
    • Be very careful in how many instances you have running per machine.
  • 6. Katta
    • Distributed Search
    • Replicated Indexes
    • Fault Tolerant
    • Direct deployment from hdfs
  • 7. Katta Issues
    • Solr is a pig, run few instances per machine.
    • Large indexes can take time to copy in and start, consuming substantial io resources
    • Use hftp: to reference your indexes, passes through firewalls and hdfs version independent.
    • Use one of the balancing distribution policies
    • Nodes don’t handle Solr OOMs gracefully
  • 8. Search Latency
    • Run as many replicas of your indexes as needed to ensure that your latency is low enough
    • Run as many solr front ends to manage latency.
  • 9. Solr Issues
    • Poorly chosen facets can cause OOMs be careful
    • Solr is slow to start, so rolling new indexes in takes time
    • Solr is a black box to Katta, unlike Lucene which is intimate.
  • 10. Updates
    • Brute Force, rebuild the entire corpus and redeploy
    • Distribute updates to deployed indexes (not implemented)
    • Merge indexes (Jason Rutherglen)
    • Distribute new indexes and handle merge in the fronting solr intances (not implemented)
  • 11. Code and Configurations
    • We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines.
    • We give each kata node jvm 4gig of heap.
    • I run 1-3 solr front end instances with 6gig of heap,
    • Code and configurations will be on www.prohadoop.com , for members.