Hadoop AWS infrastructure cost evaluation


Published on

How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • On demand: most flexible, it's also the most expensiveWith spot instances, you specify the maximum price you'll pay for an instance, and if there is space, you get that instance. If you're outbid, your instance could be terminated. This means that if you have large jobs that don't need to be completed during any specific time, you can utilize spot instances to complete the job when it's most economical.
  • Hadoop AWS infrastructure cost evaluation

    1. 1. Hadoop Platform infrastructure cost evaluation
    2. 2. Agenda • High level requirements • Cloud architecture • Major architecture components • Amazon AWS • Hadoop distributions • Capacity Planning • Amazon AWS – EMR • Hadoop distributions • On-premise hardware costs • Gotcha’s 2
    3. 3. High Level Requirements • Build an Analytical & BI platform for web log analytics • Ingest multiple data sources: • Log data • internal user data • Apply complex business rules • Manage Events, filter Crawler Driven Logs, apply Industry and Domain Specific rules • Populate/export to a BI tool for visualization. 3
    4. 4. Non-Functional Requirements • Today’s baseline: ~42 TB per year (~ 3.5TB raw data per month), 3 years store • SLA: Should process data every day. Currently done once a month. • Predefined processing via Hive; no exploratory analysis • Everything in the cloud: • Store (HDFS), Compute (M/R), Analysis (BI tool) 4
    5. 5. Non-Functional Requirements [2] • Seeding data in S3 (3 year’s data worth) • Adding monthly net-new data only. • Speed not of primary importance 5
    6. 6. Data Estimates for Capacity planning [2] • Cleaned-up log data per year 42 TB (3 years = 126 TB) • Total disk space required should consider • Compression (LZO 40%) – Reduces disk space required to  ~25 * • Replication Factor of 3 : ~75 TB • 75% disk utilization maximum in Hadoop: 100TB • Total disk capacity required for DN: ~100TB / year (17.5TB/ mo) • (*disclaimer: depends on codec and data input) 6
    7. 7. Data Estimates for Capacity planning: reduced logs Expected Data data Log data After compression Replication 70% disk utilization volume volume (TB) (Gzip 40%) on 3 nodes maximum (TB) 1 month 3.6 2.16 6.5 1 year 42 25 75 9.2 107 3 years 322 126 75.6 226 • Total disk capacity required for DN: ~10TB/ month 7
    8. 8. Cloud Solution Architecture 2. Export data to HDFS Amazon AWS 3. Process in M/R Hive Tables BI Tool Hadoop S3 HDFS 1. Copy data to S3 Client Logs 4. Display in BI tool Metadata Extraction Webservers 8 User 5. Retain results into S3
    9. 9. Hadoop on AWS: EC2 • Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud. • Manual set up of Hadoop on EC2 • Use EBS for storage capacity (HDFS) • Storage on S3 9
    10. 10. Running Hadoop on AWS: EC2 • EC2 instances options • Choose instance type • Choose instance type availability • Choose instance family • Choose where the data resides: • S3 – high latency, but highly available • EBS • Permanent storage? • Snapshots to S3? • Apache Whirr for set up 10
    11. 11. Amazon EC2 – Instance features • Other choices: • EBS-optimized instances: dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used. • Inter-region data transfer • Dedicated instances: run on single-tenant hardware dedicated to a single customer. • Spot instances: Name your price 11
    12. 12. Amazon Instance Families • Amazon EC2 instances are grouped into six families: General purpose, Memory optimized, Compute optimized, Storage optimized, micro and GPU. • General-purpose instances have memory to CPU ratios suitable for most general purpose apps. • Memory-optimized instances offer larger memory sizes for high throughput applications. • Compute-optimized instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. • Storage-optimized instances are optimized for very high random I/O performance , or very high storage density, low storage cost, and high sequential I/O performance. • micro instances provide a small amount of CPU with the ability to burst to higher amounts for brief periods. • GPU instances, for dynamic applications. Data nodes 12
    13. 13. Amazon Instances types availability • On-Demand Instances – On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining. • Reserved Instances – Reserved Instances give you the option to make a onetime payment for each instance you want to reserve and in turn receive a discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price. • Spot Instances – Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs. 13
    14. 14. Amazon EC2 – Storage 14
    15. 15. Amazon EC2 – Instance types Data nodes BI instances Master nodes 15
    16. 16. Systems Architecture – EC2 AWS Hadoop NN SN DNs EN Client Logs HDFS on EBS drives S3 BI Node Node Node BI • Hadoop cluster is initiated when analytics is run • Data is streamed from S3 to EBS Volumes • Results from analytics stored to S3 once computed 16 • BI nodes permanent Node
    17. 17. Hadoop on AWS: EC2 • Probably not the best choice: • EBS volumes make the solution costly • If instead using instance storage, choices of EC2 instances either too small (a few Gigs) or too big (48 TB/per instance). • Don’t need the flexibility – just want to use Hive 17
    18. 18. Hadoop on AWS: EMR • EC2 Amazon Elastic MapReduce ( EMR) is a web service that provides a hosted Hadoop framework running on the EC2 and Amazon Simple Storage Service (S3). 18
    19. 19. Running Hadoop on AWS - EMR • Elastic Map Reduce • For occasional jobs – Ephemeral clusters • Ease of use, but 20% costlier • Data stored in S3 - Highly tuned for S3 storage • Hive and Pig available • Only pay for S3 + instances time while jobs running • Or: leave it always on. 19
    20. 20. Hadoop on AWS - EMR • EC2 instances with own flavor of Hadoop • Amazon Apache Hadoop is 1.0.3 version. You can also choose MapR M3 or M5 (0.20.205) version. • You can run Hive (0.7.1 or 0.8.1), Custom JAR, Streaming, Pig or Hbase. 20
    21. 21. Systems Architecture – EMR AWS Hadoop EMR DNs SNNN Client Logs HDFS from S3 S3 BI Instanc e Instance Instance BI • Hadoop cluster created elastically • Data is streamed from S3 to initiate Hadoop cluster dynamically • Results from analytics stored to S3 once computed • BI nodes permanent Instance 21
    22. 22. Amazon EMR– Instance types Data nodes BI instances Master nodes 22
    23. 23. AWS calculator – EMR calculation • Calculate and add: • S3 cost (seeded data) • Incremental S3 cost, per month • EC2 cost • EMR cost • In/out Transfer of data cost • Amazon support cost • Infrastructure support Engineer cost 23
    24. 24. AWS calculator – EMR calculation • Say for 24hrs/day, EMR cost: 24
    25. 25. AWS calculator – EMR calculation • Say for 24hrs/day, 3 year S3: 25
    26. 26. AWS calculator – EMR calculation • Say for 24hrs/day, 3 year EC2: 26
    27. 27. Amazon EMR Pricing – Reduced log volume Data volume (in year) Instances types Price/year Running 24 hours/day Price/year Running 8 hours/day Price/year Running 8 hours/wee k 1 year storing 42TB on S3 10 instances – Data nodes: m1.xlarge NN: m2.2xlarge BI: m2.2xlarge Load balancer: t1.micro 1 year reserved 10 EMR instances (Subject to change depending on actual load) $14.1k/mo * 12 = $169.2k $8.9k * 12= $106k $6.6k * 12 = $79.2k $19.5k *36 mos = $684k $15.5k * 36 mos = $558k $13.2k * 36 mos = $475 3 years storing 126TB on S3 27
    28. 28. Hadoop on AWS: trade-offs Feature EC2 EMR Ease of use Hard – IT Ops costs Easy; Hadoop clusters can be of any size; can have multiple clusters. Cost Cheaper Costlier: pay for EC2 + EMR Flexibility Better: Access to full stack of Hadoop ecosystem Portability Easier to move to dedicated hardware Speed Faster Lower performance: all data is streamed from S3 for each job Maintability Can choose any vendor; Can be updated to latest versoin; Debugging tricky: cluster terminated, no logs On demand Hadoop cluster: Ease of use Hadoop installed, but with limited options 28
    29. 29. EC2 Pricing Gotcha’s • EMR with Spot instances seems to be the trend for minimal cost, if SLA timeliness is not of primary importance. • Use reserved instances to bring down cost drastically (60%). • Compression on S3 ? • Need to account for secondary NN? • Ability to estimate better how many EMR nodes are needed with AWS’s AMI task configuration 29
    30. 30. EMR Technical Gotcha’s • Transferring data between S3 and EMR clusters is very fast (and free), so long as your S3 bucket and Hadoop cluster are in the same Amazon region • EMR’S3 File System streams data directly to S3 instead of buffering to intermediate local files. • EMR’S3 File System adds Multipart Upload, which splits your writes into smaller chunks and uploads them in parallel. • Store fewer, larger files instead of many smaller ones 30 • http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
    31. 31. In house Hadoop cluster Data volume (in year) Storage for Data nodes Instances Price, first year 126TB 6*12x2TB 10 data nodes, 3 Master $10.6k * 6 DN + $7.3k * 3 = $128k Dell PowerEdge R720: Processor E5-2640 2.50GHz, 8 cores, 12M Cache,Turbo, Memory 64GB Memory, Quad Ranked RDIMM for 2 Processors, Low Volt Hard Drives 12 - 2TB 7.2K RPM SATA 3.5in Hot Plug Hard Drive Network Card Intel 82599 Dual Port 10GE Mezzanine Card BI 4 nodes + Vendor Support ($50k) + Full-time person ($150k) = $328k $43k 31
    32. 32. Licensing and support costs 32
    33. 33. Hadoop Distributions: • Cloudera or Hortonworks • Enterprise 24X7 Production Support - phone and support portal access(Support Datasheet Attached) • Minimum $50k$ 33
    34. 34. Amazon – Support EC2 & EMR Business Enterprise Response Time : 1 Hour Access: Phone, Chat and Email 24/7 Response Time: 15 minutes Access: Phone, Chat, TAM and Email 24/7 Costs Greater of $100 - or •10% of monthly AWS usage for the first $0-$10K •7% of monthly AWS usage from $10K$80K •5% of monthly AWS usage from $80K$250K •3% of monthly AWS usage from $250K+ (about $800/yr) http://aws.amazon.com/premiumsupport/ Costs Greater of $15,000 - or •10% of monthly AWS usage for the first $0-$150K •7% of monthly AWS usage from $150K$500K •5% of monthly AWS usage from $500K$1M •3% of monthly AWS usage from $1M+ 34
    35. 35. Thank You 35