Your SlideShare is downloading. ×
0
Where to Deploy Hadoop:
Bare-metal or Cloud?
Michael Wendt, Sewook Wee
Data Insights R&D Group
Copyright © 2013 Accenture All rights reserved. 2
Big Data: Bare-metal vs. Cloud
Bare-metal Cloud
On-premise
full custom
H...
Copyright © 2013 Accenture All rights reserved. 3
Big Data: Bare-metal vs. Cloud
Bare-metal Cloud
On-premise
full custom
H...
Copyright © 2013 Accenture All rights reserved. 4
Big Data: Bare-metal vs. Cloud
Bare-metal Cloud
On-premise
full custom
H...
Copyright © 2013 Accenture All rights reserved. 5Servers designed by Daniel Campos from The Noun Project
Price-Performance...
Copyright © 2013 Accenture All rights reserved. 6
Hadoop Deployment Comparison Study
Bare-metal Cloud
On-premise
full cust...
Copyright © 2013 Accenture All rights reserved. 7
Hadoop Deployment Comparison Study
TCO Analysis
Price-Performance
Ratio
...
Copyright © 2013 Accenture All rights reserved. 8
TCO of Bare-metal Hadoop Cluster
On-premise
full custom
Server
hardware
...
Copyright © 2013 Accenture All rights reserved. 9
TCO of Hadoop-as-a-Service
Hadoop-as-
a-Service
Hadoop
service
Staff for...
Copyright © 2013 Accenture All rights reserved. 10
TCO of Hadoop-as-a-Service – Instances
Hadoop
service
14 instance
types...
Copyright © 2013 Accenture All rights reserved. 11
TCO of Hadoop-as-a-Service – Instances
Hadoop
service
m1.xl
m2.4xl
cc2....
Copyright © 2013 Accenture All rights reserved. 12
TCO of Hadoop-as-a-Service – Affordable Instances
Hadoop
service
50% cl...
Copyright © 2013 Accenture All rights reserved. 13
Hadoop Deployment Comparison Study
Accenture Data Platform Benchmark
Pr...
Copyright © 2013 Accenture All rights reserved. 14
Accenture Data Platform Benchmark
Log management Sessionization
Custome...
Copyright © 2013 Accenture All rights reserved. 15
Accenture Data Platform Benchmark:
Sessionization
Log
data
Sessions
Log...
Copyright © 2013 Accenture All rights reserved. 16
Accenture Data Platform Benchmark:
Recommendation Engine
Ratings data
W...
Copyright © 2013 Accenture All rights reserved. 17
Accenture Data Platform Benchmark:
Document Clustering
Corpus of crawle...
Copyright © 2013 Accenture All rights reserved. 18
TCO analysis
Hadoop Deployment Comparison Study
Experiment Setup/Result...
Copyright © 2013 Accenture All rights reserved. 19
Experiment Setup:
Price-Performance Ratio Comparison
Bare-metal
Hadoop
...
Copyright © 2013 Accenture All rights reserved. 20
Optimize
phase
Profile
phase
Experiment Setup:
Starfish Automated Perfo...
Copyright © 2013 Accenture All rights reserved. 21
Experiment Results:
Starfish Automated Performance Tuning Tool
Manual a...
Copyright © 2013 Accenture All rights reserved. 22
408.07
229.25
125.82
381.55
204.10
166.82
250.13
172.23
114.35
ODI RI R...
Copyright © 2013 Accenture All rights reserved. 23
23.33
21.97
18.48
20.13
19.97
16.92
14.28
16.30
15.08
ODI RI RI+SI
Exec...
Copyright © 2013 Accenture All rights reserved. 24
1661.03
1157.37
784.82
1649.98
1112.68
629.98
914.35
779.98
742.38
ODI ...
Copyright © 2013 Accenture All rights reserved. 25
Key Takeaways
Hadoop-as-a-Service
offers a better price-
performance ra...
Copyright © 2013 Accenture All rights reserved. 26
Acknowledgement
Copyright © 2013 Accenture All rights reserved. 27
More details
Contact us for the full white paper: Hadoop Deployment Com...
Upcoming SlideShare
Loading in...5
×

Where to Deploy Hadoop: Bare Metal or Cloud?

5,199

Published on

Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,199
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
157
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Introduction – Michael Wendt, R&D Developer in Data Insights R&D Group at ATLAccenture Technology Labs – the forward looking R&D group of Accenture, in San Jose and 4 other locations globallyWhen enterprises decide to adopt Hadoop, they are faced with having to answer the question: Where to deploy Hadoop: Bare-metal or Cloud?
  • Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportation…- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, “sandbox”, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
  • Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportation…- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, “sandbox”, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
  • Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportation…- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, “sandbox”, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
  • Price-Performance Ratio has two divergent views for Hadoop:--click--1. Virtualized Hadoop cluster is slower because Hadoop’s workload has intensive I/O operations--click--2. Cloud-based model provides compelling cost savings - nodes are less expensive; Hadoop is horizontally scalable
  • In the Hadoop Deployment Comparison Study, we compare the price-performance ratio of a bare-metal Hadoop cluster with Hadoop-as-a-service --click--at the matched total cost of ownership (TCO) level --click--using real-world applications modeled by the Accenture Data Platform Benchmark
  • Let’s first take a look at the TCO analysis
  • *3 times replication factorServer hardware – depreciation accounted for over 3 years; full details in white paperData center – tier-3 data center 10,000 sq. ft; full details in white paperTech support – third party vendorsStaff – 3 full time employees
  • Staff – one full time employee; reduced needTech Support – AWS Premium SupportDifferent needs based on cloud environment, no need for data centerStorage Services – Amazon S3No need for servers only virtual instances of Hadoop service – Amazon EMR--click--Subtracted from budget to determine number of affordable instances--click--Calculated the
  • Time and cost prohibitive to test all 42 combinationsSelected these three instance types since they were the largest of their respective instance family
  • Time and cost prohibitive to test all 42 combinationsSelected these three instance types since they were the largest of their respective instance family
  • Assumed 50% utilization
  • Now let’s look at the Accenture Data Platform Benchmark
  • Sessionization: Constructing session from raw log data. One of several prerequisite steps for log analysis use cases (individual website optimization, infrastructure optimization, security analytics, etc.).
  • Filteringalogrithms basic and simple, while widely used.
  • *3 TB compressed
  • Experiment setup, how did everything come together?
  • Let’s switch gears…--click--8x improvement relative to default parameter settingseach iteration took about ½ - 1 full day including performance analysis, tuning, and executionThe merit of Starfish is to achieve performance increases with much less cost than manual tuning.
  • Executive summary available in limited quantities
  • Transcript of "Where to Deploy Hadoop: Bare Metal or Cloud? "

    1. 1. Where to Deploy Hadoop: Bare-metal or Cloud? Michael Wendt, Sewook Wee Data Insights R&D Group
    2. 2. Copyright © 2013 Accenture All rights reserved. 2 Big Data: Bare-metal vs. Cloud Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Hadoop Appliance Hadoop Hosting
    3. 3. Copyright © 2013 Accenture All rights reserved. 3 Big Data: Bare-metal vs. Cloud Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Hadoop Appliance Hadoop Hosting Data Privacy Data Gravity Price-Performance Ratio Productivity of Developers & Data Scientists Data Enrichment
    4. 4. Copyright © 2013 Accenture All rights reserved. 4 Big Data: Bare-metal vs. Cloud Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Hadoop Appliance Hadoop Hosting Data Privacy Data Gravity Price-Performance Ratio Productivity of Developers & Data Scientists Data Enrichment
    5. 5. Copyright © 2013 Accenture All rights reserved. 5Servers designed by Daniel Campos from The Noun Project Price-Performance Ratio Views Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Cloud? Virtualized? Slow! Who cares! I’m cheap, just throw more in! Price-Performance Ratio
    6. 6. Copyright © 2013 Accenture All rights reserved. 6 Hadoop Deployment Comparison Study Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Accenture Data Platform Benchmark + TCO analysis Price-Performance Ratio Price-Performance Ratio
    7. 7. Copyright © 2013 Accenture All rights reserved. 7 Hadoop Deployment Comparison Study TCO Analysis Price-Performance Ratio Bare-metal Cloud On-premise full custom Hadoop-as- a-Service Accenture Data Platform Benchmark + TCO analysis
    8. 8. Copyright © 2013 Accenture All rights reserved. 8 TCO of Bare-metal Hadoop Cluster On-premise full custom Server hardware Staff for operation Data center facility and electricity Technical support 24 server nodes and 50 TB of HDFS capacity* small-scale initial production deployment $3,000.00 $2,914.58 $6,656.00 $9,274.46 $21,845.04 Servers designed by Daniel Campos from The Noun Project
    9. 9. Copyright © 2013 Accenture All rights reserved. 9 TCO of Hadoop-as-a-Service Hadoop-as- a-Service Hadoop service Staff for operation Storage services Technical support Used bare-metal TCO for budget Calculated the number of affordable instances $15,318.28 $2,063.00 $1,372.27 $3,091.49 $21,845.04
    10. 10. Copyright © 2013 Accenture All rights reserved. 10 TCO of Hadoop-as-a-Service – Instances Hadoop service 14 instance types 3 pricing models 42 combinations Hadoop-as- a-Service
    11. 11. Copyright © 2013 Accenture All rights reserved. 11 TCO of Hadoop-as-a-Service – Instances Hadoop service m1.xl m2.4xl cc2.8xl Selected representative 3 instance types: m1.xlarge, m2.4xlarge, cc2.8xlarge Hadoop-as- a-Service
    12. 12. Copyright © 2013 Accenture All rights reserved. 12 TCO of Hadoop-as-a-Service – Affordable Instances Hadoop service 50% cluster utilization assumed 1/3 of budget allocated for Spot instances Instance type On-demand instances (ODI) Reserved instances (RI) Reserved + Spot instances (RI + SI) m1.xlarge 68 112 192 m2.4xlarge 20 41 77 cc2.8xlarge 13 28 53$15,318.28 Hadoop-as- a-Service
    13. 13. Copyright © 2013 Accenture All rights reserved. 13 Hadoop Deployment Comparison Study Accenture Data Platform Benchmark Price-Performance Ratio Bare-metal Cloud On-premise full custom Hadoop-as- a-Service + TCO analysis Accenture Data Platform Benchmark
    14. 14. Copyright © 2013 Accenture All rights reserved. 14 Accenture Data Platform Benchmark Log management Sessionization Customer preference prediction Recommendation engine Text Analytics Document clustering Use cases Workload Suite of real-world Hadoop MapReduce applications From client experience, internal roadmap, public literature Open- source libraries & public datasets Categorized & selected common use cases
    15. 15. Copyright © 2013 Accenture All rights reserved. 15 Accenture Data Platform Benchmark: Sessionization Log data Sessions Log data Bucketing Sorting Slicing Log data A session is a sequence of related interactions, useful to analyze as a group ~150 billion log entries, ~24 TB 1 million users, 1.1 billion sessions
    16. 16. Copyright © 2013 Accenture All rights reserved. 16 Accenture Data Platform Benchmark: Recommendation Engine Ratings data Who rated what item? Co-occurrence matrix How many people rated the pair of items? Recommendation Given the way the person rated these items, he/she is likely to be interested in these other items. Used item-based collaborative filtering algorithm Mahout example library used as foundation Generated 300 million ratings 3 million population, 50,000 items
    17. 17. Copyright © 2013 Accenture All rights reserved. 17 Accenture Data Platform Benchmark: Document Clustering Corpus of crawled web pages Filtered and tokenized documents Term dictionary TF vectors Clustered documents K-means TF-IDF vectors Groups similar documents Application components used in many areas (e.g., search engines, e-commerce site optimization) Common Crawl dataset, 10 TB corpus* ~31,000 ARC files or ~300 million HTML pages
    18. 18. Copyright © 2013 Accenture All rights reserved. 18 TCO analysis Hadoop Deployment Comparison Study Experiment Setup/Results Bare-metal Cloud + On-premise full custom Hadoop-as- a-Service Accenture Data Platform Benchmark Price-Performance Ratio
    19. 19. Copyright © 2013 Accenture All rights reserved. 19 Experiment Setup: Price-Performance Ratio Comparison Bare-metal Hadoop Cluster Amazon EMR Clusters 1 bare-metal cluster vs. 9 Amazon EMR clusters Manual and automated tuning Fixed budget for cluster size Measure execution time of benchmark Price-Performance Ratio
    20. 20. Copyright © 2013 Accenture All rights reserved. 20 Optimize phase Profile phase Experiment Setup: Starfish Automated Performance Tuning Tool Starfish (now Unravel) is an automated performance tuning tool for MapReduce jobs Speedometer designed by Filippo Camedda from The Noun Project For the experiment we ran each benchmark twice using Starfish Manual and automated tuning Measure execution time of optimize phase
    21. 21. Copyright © 2013 Accenture All rights reserved. 21 Experiment Results: Starfish Automated Performance Tuning Tool Manual and automated tuning Starfish tuned Recommendation Engine workload w/ 11 cascaded MapReduce jobs Manually tuned Sessionization workload 2+ weeks of manual tuning, ½ - 1 day iterations 8x improvement in one tuning cycle Achieve performance increases with less cost using Starfish
    22. 22. Copyright © 2013 Accenture All rights reserved. 22 408.07 229.25 125.82 381.55 204.10 166.82 250.13 172.23 114.35 ODI RI RI+SI ExecutionTime(minutes) Amazon EMR Configuration cc2.8xlarge m2.4xlarge m1.xlarge Experiment Results: Sessionization Bare-metal: 533 13 20 68 28 41 112 53 77 192
    23. 23. Copyright © 2013 Accenture All rights reserved. 23 23.33 21.97 18.48 20.13 19.97 16.92 14.28 16.30 15.08 ODI RI RI+SI ExecutionTime(minutes) Amazon EMR Configuration cc2.8xlarge m2.4xlarge m1.xlarge Experiment Results: Recommendation Engine Bare-metal: 21.59 13 20 68 28 41 112 53 77 192
    24. 24. Copyright © 2013 Accenture All rights reserved. 24 1661.03 1157.37 784.82 1649.98 1112.68 629.98 914.35 779.98 742.38 ODI RI RI+SI ExecutionTime(minutes) Amazon EMR Configuration cc2.8xlarge m2.4xlarge m1.xlarge Experiment Results: Document Clustering Bare-metal: 1186.37 13 20 68 28 41 112 53 77 192
    25. 25. Copyright © 2013 Accenture All rights reserved. 25 Key Takeaways Hadoop-as-a-Service offers a better price- performance ratio Cloud expands the performance tuning opportunities Automated performance tuning tools are a necessity Servers designed by Daniel Campos from The Noun Project
    26. 26. Copyright © 2013 Accenture All rights reserved. 26 Acknowledgement
    27. 27. Copyright © 2013 Accenture All rights reserved. 27 More details Contact us for the full white paper: Hadoop Deployment Comparison Study Michael Wendt R&D Developer Data Insights R&D Accenture Technology Labs (408) 817-2190 michael.e.wendt@accenture.com Scott Kurth Group Lead Data Insights R&D Accenture Technology Labs (408) 817-2775 scott.kurth@accenture.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×