Big Data Cloud Meetup Big Data & Cloud Computing  - Help, Educate & Demystify. June 3 rd  2011
Optimizing Bursty Hadoop <ul><li>Who I am:  Paul Baclace </li></ul><ul><li>Hadoop/Nutch work:  </li></ul><ul><li>2005-2006...
Options <ul><li>Storage on S3, EBS, local disk </li></ul><ul><li>Latencies, Prices, and stretchy clusters </li></ul><ul><l...
Goals <ul><li>Optimize bursty Hadoop analysis demands </li></ul><ul><li>Optimize testing demands </li></ul>Optimizing Burs...
Logical Information Flow DataSource--> CloudStorage--> MapReduce--> CloudStorage--> Reports Optimizing Bursty Hadoop on AW...
Variable Cost Factors <ul><li>Storage, GB per-month  </li></ul><ul><li>Access, IO operations </li></ul><ul><li>Latency (hu...
Price Insensitive,  Permanent EC2 Solution: <ul><li>HDFS(local_disk)-->EC2nodes-->HDFS(local_disk) </li></ul><ul><li>Local...
Elastic-MapReduce Solution <ul><li>Keep data in S3 and run EMR jobs </li></ul><ul><li>S3-->Elastic-MapReduce-->S3 </li></u...
EBS HDFS Solution <ul><li>HDFS(EBS)-->EC2nodes-->HDFS(EBS) </li></ul><ul><li>Cluster start/stop latency </li></ul><ul><ul>...
EBS HDFS Solution (2) <ul><li>Can be shutdown and resumed if OS is also EBS </li></ul><ul><li>Data blocks on EBS networked...
Performance <ul><li>Typical Performance of EBS and S3 </li></ul><ul><li>EBS: write-latency: 5-25msec  </li></ul><ul><li>EB...
Results 4 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  ...
Results 40 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525 ...
Results 4 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  0....
Results 40 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster  0  5.525  0...
EC2 Hadoop Set Up Tips <ul><li>Put CDH distribution on custom AMI for task-only nodes </li></ul><ul><li>Use Whirr from Clo...
EC2 Hadoop Set Up Tips (2) <ul><li>For large map-reduce, spawn tasktracker-only nodes </li></ul><ul><li>Spot check with nm...
Upcoming SlideShare
Loading in...5
×

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

1,962

Published on

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

Presentation by Paul Baclace

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,962
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

  1. 1. Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. June 3 rd 2011
  2. 2. Optimizing Bursty Hadoop <ul><li>Who I am: Paul Baclace </li></ul><ul><li>Hadoop/Nutch work: </li></ul><ul><li>2005-2006 Internet Archive with Doug Cutting </li></ul><ul><li>2008-2010 AT&T interactive </li></ul><ul><li>2010-present Euclid Elements, Yoterra </li></ul><ul><li>Contributed Patches to Hadoop/Nutch </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  3. 3. Options <ul><li>Storage on S3, EBS, local disk </li></ul><ul><li>Latencies, Prices, and stretchy clusters </li></ul><ul><li>Amazon Elastic-MapReduce and customized EC2 </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  4. 4. Goals <ul><li>Optimize bursty Hadoop analysis demands </li></ul><ul><li>Optimize testing demands </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  5. 5. Logical Information Flow DataSource--> CloudStorage--> MapReduce--> CloudStorage--> Reports Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  6. 6. Variable Cost Factors <ul><li>Storage, GB per-month </li></ul><ul><li>Access, IO operations </li></ul><ul><li>Latency (human attention) </li></ul><ul><li>Compute Cores </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  7. 7. Price Insensitive, Permanent EC2 Solution: <ul><li>HDFS(local_disk)-->EC2nodes-->HDFS(local_disk) </li></ul><ul><li>Local disks only (fast access, low latency) </li></ul><ul><li>All data lost if master node terminates </li></ul><ul><li>Difficult to move to migrate to new machines </li></ul><ul><li>Cluster start/stop latency </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  8. 8. Elastic-MapReduce Solution <ul><li>Keep data in S3 and run EMR jobs </li></ul><ul><li>S3-->Elastic-MapReduce-->S3 </li></ul><ul><li>Cluster start/stop latency </li></ul><ul><li>S3 data load time, 5-10min for 2GB in 1500 parts </li></ul><ul><li>S3 data store time, 1hour </li></ul><ul><li>Rate of writing to S3 is about 5X slower than reading </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  9. 9. EBS HDFS Solution <ul><li>HDFS(EBS)-->EC2nodes-->HDFS(EBS) </li></ul><ul><li>Cluster start/stop latency </li></ul><ul><ul><li>in standby mode with minimal nodes, no waiting </li></ul></ul><ul><ul><li>no waiting after map-reduce job finishes </li></ul></ul><ul><li>Keep a minimal standby HDFS Cluster for HDFS queries and low cost testing </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  10. 10. EBS HDFS Solution (2) <ul><li>Can be shutdown and resumed if OS is also EBS </li></ul><ul><li>Data blocks on EBS networked storage </li></ul><ul><li>Task-only nodes need no EBS storage, can be added while job is running </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  11. 11. Performance <ul><li>Typical Performance of EBS and S3 </li></ul><ul><li>EBS: write-latency: 5-25msec </li></ul><ul><li>EBS: read rate: 65MB/sec </li></ul><ul><li>EBS: write rate: 21MB/sec </li></ul><ul><li>S3: write-latency: 400msec </li></ul><ul><li>S3: read rate: 15MB/sec </li></ul><ul><li>S3: write rate: 1.5MB/sec </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  12. 12. Results 4 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.0000 S3_and_EMR 1200 14.000 0.000 6.35375 39.4150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 42.7152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  13. 13. Results 40 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 14.000 0.000 6.35375 268.150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 247.152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  14. 14. Results 4 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 140.000 0.000 6.35375 165.415 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 228.252 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  15. 15. Results 40 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.00 S3_and_EMR 1200 140.000 0.000 6.35375 394.15 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 482.52 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  16. 16. EC2 Hadoop Set Up Tips <ul><li>Put CDH distribution on custom AMI for task-only nodes </li></ul><ul><li>Use Whirr from Cloudera </li></ul><ul><li>One security group for HDFS Cluster </li></ul><ul><li>Temporary tasktracker nodes should be excluded from serving blocks </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  17. 17. EC2 Hadoop Set Up Tips (2) <ul><li>For large map-reduce, spawn tasktracker-only nodes </li></ul><ul><li>Spot check with nmon to determine whether machines are limited by disk, network or cpu. </li></ul><ul><li>Resources must be all in same availability zone </li></ul><ul><li>Billing is rounded up to hours, so provision jobs to take just under N hours of runtime </li></ul>Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

×