Data Management on Hadoop
        @ Yahoo!

     Srikanth Sundarrajan
      Principal Engineer
Why is Data Management
        important?
• Large datasets are incentives for users to come to
  grid
• Volume of data mov...
Data volumes

• Steady growth in data volumes (Data
  movement per DAY – Into the grid)
       40

       35

       30

 ...
Data Acquisition Service


                                            JT               HDFS
                             ...
Pluggable interfaces

• Different warehouse may use different
  interfaces to expose data (ex. http, scp, ftp or
  some pr...
Data load & conversion

• Heavy lifting delegated to Map-reduce jobs,
  keeping the acquisition service light
• Data load ...
Warehouse & Cluster isolation

• Source warehouses have diverse capacity,
  often constrained
• Different clusters can hav...
Job throttling
               Discovery

                                        Discovery
                               ...
Other things in consideration

• SLA, Feed priority & frequency in
  consideration for scheduling data load
• Retention to...
Thanks!
Upcoming SlideShare
Loading in...5
×

Hadoop Summit 2010 Data Management On Grid

1,341

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,341
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
57
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop Summit 2010 Data Management On Grid

  1. 1. Data Management on Hadoop @ Yahoo! Srikanth Sundarrajan Principal Engineer
  2. 2. Why is Data Management important? • Large datasets are incentives for users to come to grid • Volume of data movement • Cluster access / partitioning (Research & Production purposes) • Resource consumption • SLA’s on data availability • Data Retention • Regulatory compliance • Data conversion
  3. 3. Data volumes • Steady growth in data volumes (Data movement per DAY – Into the grid) 40 35 30 25 TB 20 15 10 5 0
  4. 4. Data Acquisition Service JT HDFS Cluster 1 JT HDFS Data Acquisition Cluster 2 Service JT HDFS Source Cluster 3 • Replication & Retention are additional Targets services that handle cross cluster data movement and data purge respectively
  5. 5. Pluggable interfaces • Different warehouse may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) • Acquisition service should be generic and have ability to plugin interfaces easily to support newer warehouses
  6. 6. Data load & conversion • Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service light • Data load executed as a map-reduce job • Data conversion as map-reduce job (to enable faster data processing post acquisition) – Fields inclusion/removal – Data filtering – Data Anonymization – Data format conversion (raw delimited / Hadoop sequence file) • Cluster to cluster copy is a map-reduce job
  7. 7. Warehouse & Cluster isolation • Source warehouses have diverse capacity, often constrained • Different clusters can have different versions of Hadoop and cluster performance may not be uniform • Need for isolation at a warehouse & cluster level and resource usage limits at a warehouse level
  8. 8. Job throttling Discovery Discovery threads Queue per source Job execution threads Async Map reduce job post resource negotiation Cluster 1 Cluster N
  9. 9. Other things in consideration • SLA, Feed priority & frequency in consideration for scheduling data load • Retention to remove old data (as required for legal compliance and for capacity purposes) • Interoperability across Hadoop versions
  10. 10. Thanks!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×