Your SlideShare is downloading. ×

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

1,922

Published on

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup …

Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup

Presentation by Paul Baclace

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,922
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. June 3 rd 2011
  • 2. Optimizing Bursty Hadoop
    • Who I am: Paul Baclace
    • Hadoop/Nutch work:
    • 2005-2006 Internet Archive with Doug Cutting
    • 2008-2010 AT&T interactive
    • 2010-present Euclid Elements, Yoterra
    • Contributed Patches to Hadoop/Nutch
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 3. Options
    • Storage on S3, EBS, local disk
    • Latencies, Prices, and stretchy clusters
    • Amazon Elastic-MapReduce and customized EC2
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 4. Goals
    • Optimize bursty Hadoop analysis demands
    • Optimize testing demands
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 5. Logical Information Flow DataSource--> CloudStorage--> MapReduce--> CloudStorage--> Reports Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 6. Variable Cost Factors
    • Storage, GB per-month
    • Access, IO operations
    • Latency (human attention)
    • Compute Cores
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 7. Price Insensitive, Permanent EC2 Solution:
    • HDFS(local_disk)-->EC2nodes-->HDFS(local_disk)
    • Local disks only (fast access, low latency)
    • All data lost if master node terminates
    • Difficult to move to migrate to new machines
    • Cluster start/stop latency
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 8. Elastic-MapReduce Solution
    • Keep data in S3 and run EMR jobs
    • S3-->Elastic-MapReduce-->S3
    • Cluster start/stop latency
    • S3 data load time, 5-10min for 2GB in 1500 parts
    • S3 data store time, 1hour
    • Rate of writing to S3 is about 5X slower than reading
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 9. EBS HDFS Solution
    • HDFS(EBS)-->EC2nodes-->HDFS(EBS)
    • Cluster start/stop latency
      • in standby mode with minimal nodes, no waiting
      • no waiting after map-reduce job finishes
    • Keep a minimal standby HDFS Cluster for HDFS queries and low cost testing
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 10. EBS HDFS Solution (2)
    • Can be shutdown and resumed if OS is also EBS
    • Data blocks on EBS networked storage
    • Task-only nodes need no EBS storage, can be added while job is running
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 11. Performance
    • Typical Performance of EBS and S3
    • EBS: write-latency: 5-25msec
    • EBS: read rate: 65MB/sec
    • EBS: write rate: 21MB/sec
    • S3: write-latency: 400msec
    • S3: read rate: 15MB/sec
    • S3: write rate: 1.5MB/sec
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 12. Results 4 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.0000 S3_and_EMR 1200 14.000 0.000 6.35375 39.4150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 42.7152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 13. Results 40 jobs per month, 100GB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 14.000 0.000 6.35375 268.150 EBS_HDFS_EC2_Tasks 0 20.000 0.425 5.67880 247.152 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 14. Results 4 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.000 S3_and_EMR 1200 140.000 0.000 6.35375 165.415 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 228.252 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 15. Results 40 jobs per month, 1TB StartupSec CostOff StandbyPerHr CostPerMapReduce CostPerMonth Perm_EC2_Cluster 0 5.525 0.000 0.00000 3978.00 S3_and_EMR 1200 140.000 0.000 6.35375 394.15 EBS_HDFS_EC2_Tasks 0 200.000 0.425 7.06300 482.52 Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 16. EC2 Hadoop Set Up Tips
    • Put CDH distribution on custom AMI for task-only nodes
    • Use Whirr from Cloudera
    • One security group for HDFS Cluster
    • Temporary tasktracker nodes should be excluded from serving blocks
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup
  • 17. EC2 Hadoop Set Up Tips (2)
    • For large map-reduce, spawn tasktracker-only nodes
    • Spot check with nmon to determine whether machines are limited by disk, network or cpu.
    • Resources must be all in same availability zone
    • Billing is rounded up to hours, so provision jobs to take just under N hours of runtime
    Optimizing Bursty Hadoop on AWS June 3 rd 2011 Meetup

×