• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop Summit 2010 Data Management On Grid

Hadoop Summit 2010 Data Management On Grid






Total Views
Views on SlideShare
Embed Views



2 Embeds 112

http://thoughts.vinayakhegde.com 102
http://www.slideshare.net 10



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hadoop Summit 2010 Data Management On Grid Hadoop Summit 2010 Data Management On Grid Presentation Transcript

    • Data Management on Hadoop @ Yahoo! Srikanth Sundarrajan Principal Engineer
    • Why is Data Management important? • Large datasets are incentives for users to come to grid • Volume of data movement • Cluster access / partitioning (Research & Production purposes) • Resource consumption • SLA’s on data availability • Data Retention • Regulatory compliance • Data conversion
    • Data volumes • Steady growth in data volumes (Data movement per DAY – Into the grid) 40 35 30 25 TB 20 15 10 5 0
    • Data Acquisition Service JT HDFS Cluster 1 JT HDFS Data Acquisition Cluster 2 Service JT HDFS Source Cluster 3 • Replication & Retention are additional Targets services that handle cross cluster data movement and data purge respectively
    • Pluggable interfaces • Different warehouse may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) • Acquisition service should be generic and have ability to plugin interfaces easily to support newer warehouses
    • Data load & conversion • Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service light • Data load executed as a map-reduce job • Data conversion as map-reduce job (to enable faster data processing post acquisition) – Fields inclusion/removal – Data filtering – Data Anonymization – Data format conversion (raw delimited / Hadoop sequence file) • Cluster to cluster copy is a map-reduce job
    • Warehouse & Cluster isolation • Source warehouses have diverse capacity, often constrained • Different clusters can have different versions of Hadoop and cluster performance may not be uniform • Need for isolation at a warehouse & cluster level and resource usage limits at a warehouse level
    • Job throttling Discovery Discovery threads Queue per source Job execution threads Async Map reduce job post resource negotiation Cluster 1 Cluster N
    • Other things in consideration • SLA, Feed priority & frequency in consideration for scheduling data load • Retention to remove old data (as required for legal compliance and for capacity purposes) • Interoperability across Hadoop versions
    • Thanks!