Cloud-based architectures of Hadoop have made it attractive for public cloud service providers to offer hosted Hadoop services and charge customers on a pay-for-what-you-use basis. For enterprises that have already adopted Hadoop, the data infrastructure has long been seen as a cost element in their budgets. As a result, enterprises thinking of adopting Hadoop are increasingly debating between on-premise and cloud-based models for their data processing needs.
We lay out a set of criteria and methodical approaches to help enterprises that have not yet adopted Hadoop evaluate their options, and discuss the pros and cons of both models. For enterprises that have already made significant investments or have plans to build a Hadoop-based infrastructure, we present an approach to manage Hadoop as a Service with a P&L, transparency in costs, and metering & billing provisions.
As we discuss these approaches, we will share insights gathered from the exercise conducted on one of the largest Hadoop footprints in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs for usage, measure the resource usage for services, optimize for higher utilization, and benchmark costs.
URL: http://strataconf.com/stratany2013/public/schedule/detail/30824
DBA Basics: Getting Started with Performance Tuning.pdf
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business
1. Running On-premise Hadoop as a Business
S u m e e t S i n g h
H e a d o f P r o d u c t s , C l o u d S e r v i c e s a n d H a d o o p , Ya h o o ! I n c .
Strata Conference + Hadoop World 2013, NY
2. Motivation
2
Public Cloud’s PopularityInfrastructure Cost
Cost bucket with no direct
revenue (private cloud),
increasing spend on newer
configs, network and services
Popularity of public clouds and
rising concern that public cloud is
a viable and cost effective
alternative to our operations
Average Utilization
Utilization on an average
considered low for good ROI, lack
of good usage metrics by BUs and
chargeback/ showback provisions
Institute Hadoop platform’s P&L
structure
Setup metering and billing
provisions
Provide transparency in costs and
benchmarks
Strata Conference + Hadoop World 2013, New York
3. Hadoop Evolution at Yahoo
3 Strata Conference + Hadoop World 2013, New York
0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013
RawHDFSStorage(inPB)
NumberofNodes
Year
Number of Nodes HDFS
Yahoo! Commits
to Scaling
Hadoop for
Production Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security, Multi-
tenancy, and
SLAs
Open Sourced
with Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen Hadoop
(H 0.23 YARN)
New Services
(Multi-tenant
HBase, Storm
etc.)
Increased User-
base
with partitioned
namespaces
4. Outline
4 Strata Conference + Hadoop World 2013, New York
Criteria and considerations for evaluating options1
Total Cost of Ownership (TCO) models for understanding true costs2
Deeper understanding of (resource) usage patterns3
P&L, metering and billing provisions4
Benchmark costs5
Improve utilization and ROI6
5. Deployment Models at Yahoo
5 Strata Conference + Hadoop World 2013, New York
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
Private (dedicated)
Clusters
Hosted Multi-tenant
(Private Cloud)
Clusters
Hosted Compute
Clusters
§ New technology
introduced before
platformization
§ Relic; should not be there
§ Data movement and
regulation issues
§ Acquisitions with pre-
existing use of public
cloud infrastructure
§ Exploration and learning
§ Benchmarks
§ Source of truth for all data
§ App delivery agility
§ Operational efficiency and
cost savings through
economies of scale
On-Premise Public Cloud
6. Criteria to Consider
6 Strata Conference + Hadoop World 2013, New York
§ Fixed, does not vary with utilization
§ Favors scale and 24x7 operation
§ Variable with usage
§ Typically favors run and done model
Cost
§ Aggregated from disparate or distributed
sources
§ Typically generated and stored in the cloudData
§ Job queues, cap. scheduling, BCP, catchup
§ Controlled latency and throughput
§ No guarantees (beyond uptime) without
provisioning additional resources
SLA
§ Control over deployed technology
§ Requires platform team/ vendor support
§ Little to no control over tech stack
§ No need for platform R&D headcount
Tech Stack
§ Shared environment, control over data and
movement, PIIs, ACLs, pluggable security
§ Data typically not shared among users in
the cloud
Security
§ Matters, complex to develop and operate
§ Does not matter, clusters are dynamic and
dedicated
Multi-tenancy
On-Premise Public CloudCriteria
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
7. Approach to Evaluate Options
7 Strata Conference + Hadoop World 2013, New York
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
On-Premise
Public Cloud
Cost
Data
SLA
Tech Stack
Security
Multi-tenancy
8. Calculating Total Cost of Ownership
8 Strata Conference + Hadoop World 2013, New York
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
§ Headcount for service engineering and data operations teams responsible for day-to-day ops and support
6
Acquisition/ Install (One-time)
§ Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
§ Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
§ Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
§ Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
§ Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Network Bandwidth
§ Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE
9. Determining Unit Costs
9
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Compute
Slots or Containers
where apps can
perform computation
and access HDFS if
needed
Storage
HFDS (usable) space
needed by an app with
default replication
factor of three
Network bandwidth
needed to move
data into/out of the
clusters by the app
Bandwidth Namespace
Files and
directories used by
the apps to
understand/ limit the
load on NN
$ / Slot-Hour (H 1.0)
$ / GB-Hour (H 0.23/2.0)
Number of Slots / GBs
of Memory available for
an hour
Monthly Compute Cost
Avail. Compute Capacity
$ / GB of data stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly BW Cost
Monthly GB In + Out
N/A
N/A
N/A
Strata Conference + Hadoop World 2013, New York
10. Working Through An Example
10 Strata Conference + Hadoop World 2013, New York
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Monthly TCO (less bw.) = $2 M
Compute @ 50% = $1 M
185 K slots == 185 K x 24 x 30
= 133 M Slot Hours
315 TB mem == 315 TB x 24 x 30
= 227 M GB-Hours
$ 1 M / 133 M Slot-Hours
= $0.007 / Slot-Hour / Month
$1 M / 227 M GB-Hours
= $0.004 / GB-Hour / Month
Monthly TCO (less bw.) = $2 M
Storage @ 50% = $1 M
RAW HDFS = 200 PB
Usable HDFS == [ 200 x 0.8 (20%
overhead) ] / 3
= 53 PB
$ 1 M / 53 PB
= $ 0.019 / GB / Month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $0.1 M
Total Data In + Out = 5 PB
$ 0.1 M / 5 PB
= $ 0.02/ GB transferred
Compute Storage Bandwidth
ILLUSTRATIVE
11. Understanding Compute Units
11 Strata Conference + Hadoop World 2013, New York
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Map Task 1
Reduce Task
§ Each node in the cluster has its
memory divided into a number of map
and reduce slots
§ A map task runs in one or more map
slots, and a reduce task runs in one or
more reduce slots
Map Task 2 Reduce Task
Hadoop 1.0/ 0.20 Hadoop 2.0/ 0.23
§ Each node has its memory carved up
into fixed-sized partitions based on
configured minimum
§ Map and reduce tasks run in a YARN
container (memory needed / reserved
for a task)
Task 1
Task 2
Task 3
12. Understanding Compute Units (Cont’d)
12 Strata Conference + Hadoop World 2013, New York
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
13. Metering / Determining Usage
13 Strata Conference + Hadoop World 2013, New York
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
Map Slot-Hours = #S(M1) x T(M1) +
#S(M2) x T(M2) + …
Reduce Slot-Hours = #S(R1) x T(R1)
+ #S(R2) x T(R2) + …
Cost = (M + R) Slot-Hour x $0.007 /
Slot-Hour / Month
= $ for the Job/ Month *
* Memory based approach in H0.23 is identical
(M+R) Slot-Hours for all jobs can
summed up for the month for a user,
app, BU, or the entire platform
Monthly Job
and Task Cost
Monthly Roll-
ups
Compute Storage Bandwidth
/ project (app) directory quota in
GB (peak monthly storage used)
/ user directory quota in GB (peak
monthly storage used)
/ data is accounted for as each user
accountable for their portion of use.
For e.g.
GB Read (U1)
GB Read (U1) + GB Read (U2) + …
Roll-ups through relationship
among user, file ownership, app,
and their BU
Bandwidth measured at the cluster
level and divided among select
apps and users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their BU
ILLUSTRATIVE
14. Hadoop Analytics Warehouse (Starling)
14 Strata Conference + Hadoop World 2013, New York
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling
Dashboard
Customer
Support Portal
Query Server
Source
Clusters
Warehouse
Clusters
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
15. One Warehouse, Many Use
15 Strata Conference + Hadoop World 2013, New York
Information Management
§ Gather all usage data (JobHistory, Task, Job-counter etc.) from
source clusters into a central warehouse
§ 1TB of raw logs processed / day, 24 TB of processed data
Business Intelligence
§ Store processed logs in HCatalog
§ Run historical analysis (Hive, Pig, MapReduce) or usage
graphs/ reports with BI tools
Predictive Analytics
§ Project growth-trend of datasets
§ Plan capacity/ headroom and CapEx across all business units
in advance
Data Storage
Efficiency
Metering and
Chargebacks
Utilization
Improvements
Product
Improvements
Best Practices
and Patterns
Tuning for
Efficiency
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
16. Billing, Chargebacks, or Showbacks
16 Strata Conference + Hadoop World 2013, New York
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
BU
HDFS (Storage) Compute Network Bandwidth
Total Cost
($ M)Used
(PB)
Effective Used
(PB)
Cost
($ M)
Used
(Slot hour)
Cost
($ M)
Transferred
(GB)
Cost
($ M)
BU1 15 PB 4.0 PB $0.07 7.1 M $0.05 1.25 PB $0.02 $0.14 M
BU2 10 PB 2.7 PB $0.05 3.5 M $0.02 0.50 PB $0.01 $0.08 M
… …. … … … … … …
BU N … … … … … … ...
Total 148 PB 39.5 PB $0.75 M 71.4 M $0.5 M 5.0 PB $0.1 M $1.35 M
Resource
Unit
Aggregated / Measured
Cost
HDFS (Storage)
GB
Monthly, Peak storage used
$ 0.019/GB
Compute
Map-Reduce Slot Hours
Number of slots used by mappers and reducers and hours they ran for
$ 0.007/Slot-Hour
Network Bandwidth
GB
Monthly, total in /out
$ 0.02/GB
Hadoop Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for Sep 2013
ILLUSTRATIVE
17. Hadoop P&L (LTM)
17 Strata Conference + Hadoop World 2013, New York
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total %
Y! Gross Revenues
Cost of revenues (less Hadoop CapEx)
Gross Profit
Hadoop OpEx
R&D Headcount
SE&O Headcount
Acquisition/Install
Active Use/ Ops
Network Bandwidth
Total Hadoop OpEx
Hadoop CapEx
Hadoop Grid Services
Total Hadoop CapEx
Contribution Margin
Indirect Costs
G&A
Sales and Marketing
ILLUSTRATIVE
18. Hadoop P&L (LTM)
18 Strata Conference + Hadoop World 2013, New York
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
1
1.3
1.2
1.5
0.5
0.7
0.9
1.1
1.3
1.5
1.7
Q4'12 Q1'13 Q2'13 Q3'13
HadoopCostsas%ofGrossRev
Hadoop Costs as % of Gross Profit Hadoop Cost Breakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Q4'12 Q1'13 Q2'13 Q3'13
Capex R&D HC Ops HC Other Opex
ILLUSTRATIVE
OpEx
CapEx
19. An Approach to Benchmarking Costs
19 Strata Conference + Hadoop World 2013, New York
ROIOptions
1 2 3 5 64
TCO Usage P&L Benchmark
On-Premise Public Cloud
Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.)
M/R 71.4 M 61.6 M 133 M
Compute Instances (normalized time,
RAM, 32/64 ops, I/O etc.)
1,000
instances/ hr.
HDFS 148 PB 52 PB 200 PB
Storage
(account for 3x repl., job/ app space)
30 PB/ month
Avg. Data
Processed
- - 75 PB Instance Storage 2.5 PB daily
M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M
HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M
Other Costs (if any) such as reads,
writes, data services/ hour etc.
$0.25 M
Total * $1.25 M $0.75 M $2 M Total $ 1.95 M
ILLUSTRATIVE
Quantity
equivalent
Cost
equivalent
* Ignored bandwidth, assumed equivalent
20. Utilization Matters
20 Strata Conference + Hadoop World 2013, New York
ROIOptions
1 2 3 5 64
TCO Usage P&L Benchmark
Utilization / Consumption (Compute and Storage)
TotalCost($)
On-premise Hadoop
as a Service
On-demand public
cloud service
Terms-based public
cloud service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Sensitivity analysis on
costs based on current
and expected utilization
or target utilization can
provide further insights
into your operations and
cost competitiveness
Highstartingcost
Scalingup
21. Focus on ROI
21 Strata Conference + Hadoop World 2013, New York
Time
CostAmortizedoverApps($)
Phase I 2012 – 2013 (H 0.23) 2014 & Future
Time = t Time = t’
Cost (t) = C
Cost (t’)= C’
# App continue to
grow on the Platform
At time t, BU profits are
R (t) – C(t) = π (t)
Platform’s goal is to continue to
increase the ROI while
supporting new technology and
services
R (t’) – C (t’) = π (t’), where
C (t’) < C (t) and π (t’) > π (t) for
same or bigger revenues.
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
22. Cluster Organization
22 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
Requirements
Project and Software Rollout Phases
Development, test/ staging, and go-live
Sandbox, Research, and Production Clusters
Dedicated Resources, BCP, Catch-up Capacity
Batch, Performance, Low Latency Clusters
Multi-tenant Clusters
Service Level Agreements
Processing time, failures, late data arrival
Technology Needs
MapReduce, HDFS, HBase, Storm, Spark
Multi-tenancy
Multiple customers with same tech needs
Cluster (Cost) Implications
23. Improving ROI with Service Innovation
23 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch
Processing)
Spark
(Iterative
Processing)
Storm
(Stream
Processing)
Video
Processing
Graph
Processing
Predictive
Analytics
Coming soon on
YARN
Available
today
…
New Services on YARN
Search &
Indexing
Growing with YARN
24. YARN Proof Points
24 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
tasktracker totals | mrappmaster totals
H 0.23 rollout (Jan 2013)
H 0.23 rollout (Oct 2012)
Advertising Production
Cluster
Audience Production
Historical Data Warehouse
25. Going Forward
25 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
High Availability
Data processing pipelines “zero” downtime
requirement
NN HA, Rolling Upgrades, RM HA, Oozie HA
etc.
Scalability
Bigger clusters with most of the data at
one place
NN vertical and horizontal scaling
RM/ YARN
Support for long-running jobs for low
latency processing and queries
YARN support for long-running jobs, YARN API
improvements, CPU as a resource, Fluid
compute resources
Scheduling
Preempt low priority jobs to meet SLAs for
high-priority jobs
CPU based scheduling, Pre-emption with job/
app priority, Gang scheduling
Why Required What is Getting Done
26. 26 Strata Conference + Hadoop World 2013, New York
Thank You
Sumeet Singh, <sumeetsi@yahoo-inc.com>
@sumeetksingh