Where to Deploy Hadoop: Bare Metal or Cloud?

  • 4,132 views
Uploaded on

Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized …

Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,132
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
102
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Introduction – Michael Wendt, R&D Developer in Data Insights R&D Group at ATLAccenture Technology Labs – the forward looking R&D group of Accenture, in San Jose and 4 other locations globallyWhen enterprises decide to adopt Hadoop, they are faced with having to answer the question: Where to deploy Hadoop: Bare-metal or Cloud?
  • Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportation…- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, “sandbox”, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
  • Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportation…- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, “sandbox”, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
  • Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportation…- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, “sandbox”, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
  • Price-Performance Ratio has two divergent views for Hadoop:--click--1. Virtualized Hadoop cluster is slower because Hadoop’s workload has intensive I/O operations--click--2. Cloud-based model provides compelling cost savings - nodes are less expensive; Hadoop is horizontally scalable
  • In the Hadoop Deployment Comparison Study, we compare the price-performance ratio of a bare-metal Hadoop cluster with Hadoop-as-a-service --click--at the matched total cost of ownership (TCO) level --click--using real-world applications modeled by the Accenture Data Platform Benchmark
  • Let’s first take a look at the TCO analysis
  • *3 times replication factorServer hardware – depreciation accounted for over 3 years; full details in white paperData center – tier-3 data center 10,000 sq. ft; full details in white paperTech support – third party vendorsStaff – 3 full time employees
  • Staff – one full time employee; reduced needTech Support – AWS Premium SupportDifferent needs based on cloud environment, no need for data centerStorage Services – Amazon S3No need for servers only virtual instances of Hadoop service – Amazon EMR--click--Subtracted from budget to determine number of affordable instances--click--Calculated the
  • Time and cost prohibitive to test all 42 combinationsSelected these three instance types since they were the largest of their respective instance family
  • Time and cost prohibitive to test all 42 combinationsSelected these three instance types since they were the largest of their respective instance family
  • Assumed 50% utilization
  • Now let’s look at the Accenture Data Platform Benchmark
  • Sessionization: Constructing session from raw log data. One of several prerequisite steps for log analysis use cases (individual website optimization, infrastructure optimization, security analytics, etc.).
  • Filteringalogrithms basic and simple, while widely used.
  • *3 TB compressed
  • Experiment setup, how did everything come together?
  • Let’s switch gears…--click--8x improvement relative to default parameter settingseach iteration took about ½ - 1 full day including performance analysis, tuning, and executionThe merit of Starfish is to achieve performance increases with much less cost than manual tuning.
  • Executive summary available in limited quantities

Transcript

  • 1. Where to Deploy Hadoop: Bare-metal or Cloud? Michael Wendt, Sewook Wee Data Insights R&D Group
  • 2. Copyright © 2013 Accenture All rights reserved. 2 Big Data: Bare-metal vs. Cloud Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Hadoop Appliance Hadoop Hosting
  • 3. Copyright © 2013 Accenture All rights reserved. 3 Big Data: Bare-metal vs. Cloud Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Hadoop Appliance Hadoop Hosting Data Privacy Data Gravity Price-Performance Ratio Productivity of Developers & Data Scientists Data Enrichment
  • 4. Copyright © 2013 Accenture All rights reserved. 4 Big Data: Bare-metal vs. Cloud Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Hadoop Appliance Hadoop Hosting Data Privacy Data Gravity Price-Performance Ratio Productivity of Developers & Data Scientists Data Enrichment
  • 5. Copyright © 2013 Accenture All rights reserved. 5Servers designed by Daniel Campos from The Noun Project Price-Performance Ratio Views Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Cloud? Virtualized? Slow! Who cares! I’m cheap, just throw more in! Price-Performance Ratio
  • 6. Copyright © 2013 Accenture All rights reserved. 6 Hadoop Deployment Comparison Study Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Accenture Data Platform Benchmark + TCO analysis Price-Performance Ratio Price-Performance Ratio
  • 7. Copyright © 2013 Accenture All rights reserved. 7 Hadoop Deployment Comparison Study TCO Analysis Price-Performance Ratio Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Accenture Data Platform Benchmark + TCO analysis
  • 8. Copyright © 2013 Accenture All rights reserved. 8 TCO of Bare-metal Hadoop Cluster On-premise full custom Server hardware Staff for operation Data center facility and electricity Technical support 24 server nodes and 50 TB of HDFS capacity* small-scale initial production deployment $3,000.00 $2,914.58 $6,656.00 $9,274.46 $21,845.04 Servers designed by Daniel Campos from The Noun Project
  • 9. Copyright © 2013 Accenture All rights reserved. 9 TCO of Hadoop-as-a-Service Hadoop-as- a-Service Hadoop service Staff for operation Storage services Technical support Used bare-metal TCO for budget Calculated the number of affordable instances $15,318.28 $2,063.00 $1,372.27 $3,091.49 $21,845.04
  • 10. Copyright © 2013 Accenture All rights reserved. 10 TCO of Hadoop-as-a-Service – Instances Hadoop service 14 instance types 3 pricing models 42 combinations Hadoop-as- a-Service
  • 11. Copyright © 2013 Accenture All rights reserved. 11 TCO of Hadoop-as-a-Service – Instances Hadoop service m1.xl m2.4xl cc2.8xl Selected representative 3 instance types: m1.xlarge, m2.4xlarge, cc2.8xlarge Hadoop-as- a-Service
  • 12. Copyright © 2013 Accenture All rights reserved. 12 TCO of Hadoop-as-a-Service – Affordable Instances Hadoop service 50% cluster utilization assumed 1/3 of budget allocated for Spot instances Instance type On-demand instances (ODI) Reserved instances (RI) Reserved + Spot instances (RI + SI) m1.xlarge 68 112 192 m2.4xlarge 20 41 77 cc2.8xlarge 13 28 53$15,318.28 Hadoop-as- a-Service
  • 13. Copyright © 2013 Accenture All rights reserved. 13 Hadoop Deployment Comparison Study Accenture Data Platform Benchmark Price-Performance Ratio Bare-metal Cloud On-premise full custom Hadoop-as- a-Service + TCO analysis Accenture Data Platform Benchmark
  • 14. Copyright © 2013 Accenture All rights reserved. 14 Accenture Data Platform Benchmark Log management Sessionization Customer preference prediction Recommendation engine Text Analytics Document clustering Use cases Workload Suite of real-world Hadoop MapReduce applications From client experience, internal roadmap, public literature Open- source libraries & public datasets Categorized & selected common use cases
  • 15. Copyright © 2013 Accenture All rights reserved. 15 Accenture Data Platform Benchmark: Sessionization Log data Sessions Log data Bucketing Sorting Slicing Log data A session is a sequence of related interactions, useful to analyze as a group ~150 billion log entries, ~24 TB 1 million users, 1.1 billion sessions
  • 16. Copyright © 2013 Accenture All rights reserved. 16 Accenture Data Platform Benchmark: Recommendation Engine Ratings data Who rated what item? Co-occurrence matrix How many people rated the pair of items? Recommendation Given the way the person rated these items, he/she is likely to be interested in these other items. Used item-based collaborative filtering algorithm Mahout example library used as foundation Generated 300 million ratings 3 million population, 50,000 items
  • 17. Copyright © 2013 Accenture All rights reserved. 17 Accenture Data Platform Benchmark: Document Clustering Corpus of crawled web pages Filtered and tokenized documents Term dictionary TF vectors Clustered documents K-means TF-IDF vectors Groups similar documents Application components used in many areas (e.g., search engines, e-commerce site optimization) Common Crawl dataset, 10 TB corpus* ~31,000 ARC files or ~300 million HTML pages
  • 18. Copyright © 2013 Accenture All rights reserved. 18 TCO analysis Hadoop Deployment Comparison Study Experiment Setup/Results Bare-metal Cloud + On-premise full custom Hadoop-as- a-Service Accenture Data Platform Benchmark Price-Performance Ratio
  • 19. Copyright © 2013 Accenture All rights reserved. 19 Experiment Setup: Price-Performance Ratio Comparison Bare-metal Hadoop Cluster Amazon EMR Clusters 1 bare-metal cluster vs. 9 Amazon EMR clusters Manual and automated tuning Fixed budget for cluster size Measure execution time of benchmark Price-Performance Ratio
  • 20. Copyright © 2013 Accenture All rights reserved. 20 Optimize phase Profile phase Experiment Setup: Starfish Automated Performance Tuning Tool Starfish (now Unravel) is an automated performance tuning tool for MapReduce jobs Speedometer designed by Filippo Camedda from The Noun Project For the experiment we ran each benchmark twice using Starfish Manual and automated tuning Measure execution time of optimize phase
  • 21. Copyright © 2013 Accenture All rights reserved. 21 Experiment Results: Starfish Automated Performance Tuning Tool Manual and automated tuning Starfish tuned Recommendation Engine workload w/ 11 cascaded MapReduce jobs Manually tuned Sessionization workload 2+ weeks of manual tuning, ½ - 1 day iterations 8x improvement in one tuning cycle Achieve performance increases with less cost using Starfish
  • 22. Copyright © 2013 Accenture All rights reserved. 22 408.07 229.25 125.82 381.55 204.10 166.82 250.13 172.23 114.35 ODI RI RI+SI ExecutionTime(minutes) Amazon EMR Configuration cc2.8xlarge m2.4xlarge m1.xlarge Experiment Results: Sessionization Bare-metal: 533 13 20 68 28 41 112 53 77 192
  • 23. Copyright © 2013 Accenture All rights reserved. 23 23.33 21.97 18.48 20.13 19.97 16.92 14.28 16.30 15.08 ODI RI RI+SI ExecutionTime(minutes) Amazon EMR Configuration cc2.8xlarge m2.4xlarge m1.xlarge Experiment Results: Recommendation Engine Bare-metal: 21.59 13 20 68 28 41 112 53 77 192
  • 24. Copyright © 2013 Accenture All rights reserved. 24 1661.03 1157.37 784.82 1649.98 1112.68 629.98 914.35 779.98 742.38 ODI RI RI+SI ExecutionTime(minutes) Amazon EMR Configuration cc2.8xlarge m2.4xlarge m1.xlarge Experiment Results: Document Clustering Bare-metal: 1186.37 13 20 68 28 41 112 53 77 192
  • 25. Copyright © 2013 Accenture All rights reserved. 25 Key Takeaways Hadoop-as-a-Service offers a better price- performance ratio Cloud expands the performance tuning opportunities Automated performance tuning tools are a necessity Servers designed by Daniel Campos from The Noun Project
  • 26. Copyright © 2013 Accenture All rights reserved. 26 Acknowledgement
  • 27. Copyright © 2013 Accenture All rights reserved. 27 More details Contact us for the full white paper: Hadoop Deployment Comparison Study Michael Wendt R&D Developer Data Insights R&D Accenture Technology Labs (408) 817-2190 michael.e.wendt@accenture.com Scott Kurth Group Lead Data Insights R&D Accenture Technology Labs (408) 817-2775 scott.kurth@accenture.com