Your SlideShare is downloading. ×
0
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Hadoop Summit 2010 Data Management On Grid
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop Summit 2010 Data Management On Grid

1,323

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,323
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
57
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Management on Hadoop @ Yahoo! Srikanth Sundarrajan Principal Engineer
  • 2. Why is Data Management important? • Large datasets are incentives for users to come to grid • Volume of data movement • Cluster access / partitioning (Research & Production purposes) • Resource consumption • SLA’s on data availability • Data Retention • Regulatory compliance • Data conversion
  • 3. Data volumes • Steady growth in data volumes (Data movement per DAY – Into the grid) 40 35 30 25 TB 20 15 10 5 0
  • 4. Data Acquisition Service JT HDFS Cluster 1 JT HDFS Data Acquisition Cluster 2 Service JT HDFS Source Cluster 3 • Replication & Retention are additional Targets services that handle cross cluster data movement and data purge respectively
  • 5. Pluggable interfaces • Different warehouse may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) • Acquisition service should be generic and have ability to plugin interfaces easily to support newer warehouses
  • 6. Data load & conversion • Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service light • Data load executed as a map-reduce job • Data conversion as map-reduce job (to enable faster data processing post acquisition) – Fields inclusion/removal – Data filtering – Data Anonymization – Data format conversion (raw delimited / Hadoop sequence file) • Cluster to cluster copy is a map-reduce job
  • 7. Warehouse & Cluster isolation • Source warehouses have diverse capacity, often constrained • Different clusters can have different versions of Hadoop and cluster performance may not be uniform • Need for isolation at a warehouse & cluster level and resource usage limits at a warehouse level
  • 8. Job throttling Discovery Discovery threads Queue per source Job execution threads Async Map reduce job post resource negotiation Cluster 1 Cluster N
  • 9. Other things in consideration • SLA, Feed priority & frequency in consideration for scheduling data load • Retention to remove old data (as required for legal compliance and for capacity purposes) • Interoperability across Hadoop versions
  • 10. Thanks!

×