Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business

Running On-premise Hadoop as a Business
S u m e e t S i n g h
H e a d o f P r o d u c t s , C l o u d S e r v i c e s a n d H a d o o p , Ya h o o ! I n c .
Strata Conference + Hadoop World 2013, NY

Motivation
2
Public Cloud’s PopularityInfrastructure Cost
Cost bucket with no direct
revenue (private cloud),
increasing spend on newer
configs, network and services
Popularity of public clouds and
rising concern that public cloud is
a viable and cost effective
alternative to our operations
Average Utilization
Utilization on an average
considered low for good ROI, lack
of good usage metrics by BUs and
chargeback/ showback provisions
Institute Hadoop platform’s P&L
structure
Setup metering and billing
provisions
Provide transparency in costs and
benchmarks
Strata Conference + Hadoop World 2013, New York

Hadoop Evolution at Yahoo
3 Strata Conference + Hadoop World 2013, New York
0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013
RawHDFSStorage(inPB)
NumberofNodes
Year
Number of Nodes HDFS
Yahoo! Commits
to Scaling
Hadoop for
Production Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security, Multi-
tenancy, and
SLAs
Open Sourced
with Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen Hadoop
(H 0.23 YARN)
New Services
(Multi-tenant
HBase, Storm
etc.)
Increased User-
base
with partitioned
namespaces

Outline
Criteria and considerations for evaluating options1
Total Cost of Ownership (TCO) models for understanding true costs2
Deeper understanding of (resource) usage patterns3
P&L, metering and billing provisions4
Benchmark costs5
Improve utilization and ROI6

Deployment Models at Yahoo
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
Private (dedicated)
Clusters
Hosted Multi-tenant
(Private Cloud)
Clusters
Hosted Compute
Clusters
§  New technology
introduced before
platformization
§  Relic; should not be there
§  Data movement and
regulation issues
§  Acquisitions with pre-
existing use of public
cloud infrastructure
§  Exploration and learning
§  Benchmarks
§  Source of truth for all data
§  App delivery agility
§  Operational efficiency and
cost savings through
economies of scale
On-Premise Public Cloud

Criteria to Consider
§  Fixed, does not vary with utilization
§  Favors scale and 24x7 operation
§  Variable with usage
§  Typically favors run and done model
Cost
§  Aggregated from disparate or distributed
sources
§  Typically generated and stored in the cloudData
§  Job queues, cap. scheduling, BCP, catchup
§  Controlled latency and throughput
§  No guarantees (beyond uptime) without
provisioning additional resources
SLA
§  Control over deployed technology
§  Requires platform team/ vendor support
§  Little to no control over tech stack
§  No need for platform R&D headcount
Tech Stack
§  Shared environment, control over data and
movement, PIIs, ACLs, pluggable security
§  Data typically not shared among users in
the cloud
Security
§  Matters, complex to develop and operate
§  Does not matter, clusters are dynamic and
dedicated
Multi-tenancy
On-Premise Public CloudCriteria
1 2 3 5 64

Approach to Evaluate Options
1 2 3 5 64
On-Premise
Public Cloud
Cost
Data
SLA
Tech Stack
Security
Multi-tenancy

Calculating Total Cost of Ownership
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
§  Headcount for service engineering and data operations teams responsible for day-to-day ops and support
6
Acquisition/ Install (One-time)
§  Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
§  Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
§  Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
§  Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
§  Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Network Bandwidth
§  Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE

Determining Unit Costs
9
1 2 3 5 64
TCO
Compute
Slots or Containers
where apps can
perform computation
and access HDFS if
needed
Storage
HFDS (usable) space
needed by an app with
default replication
factor of three
Network bandwidth
needed to move
data into/out of the
clusters by the app
Bandwidth Namespace
Files and
directories used by
the apps to
understand/ limit the
load on NN
$ / Slot-Hour (H 1.0)
$ / GB-Hour (H 0.23/2.0)
Number of Slots / GBs
of Memory available for
an hour
Monthly Compute Cost
Avail. Compute Capacity
$ / GB of data stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly BW Cost
Monthly GB In + Out
N/A
N/A
N/A
Strata Conference + Hadoop World 2013, New York

Working Through An Example
1 2 3 5 64
TCO
Monthly TCO (less bw.) = $2 M
Compute @ 50% = $1 M
185 K slots == 185 K x 24 x 30
= 133 M Slot Hours
315 TB mem == 315 TB x 24 x 30
= 227 M GB-Hours
$ 1 M / 133 M Slot-Hours
= $0.007 / Slot-Hour / Month
$1 M / 227 M GB-Hours
= $0.004 / GB-Hour / Month
Monthly TCO (less bw.) = $2 M
Storage @ 50% = $1 M
RAW HDFS = 200 PB
Usable HDFS == [ 200 x 0.8 (20%
overhead) ] / 3
= 53 PB
$ 1 M / 53 PB
= $ 0.019 / GB / Month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $0.1 M
Total Data In + Out = 5 PB
$ 0.1 M / 5 PB
= $ 0.02/ GB transferred
Compute Storage Bandwidth
ILLUSTRATIVE

Understanding Compute Units
1 2 3 5 64
TCO
Map Task 1
Reduce Task
§  Each node in the cluster has its
memory divided into a number of map
and reduce slots
§  A map task runs in one or more map
slots, and a reduce task runs in one or
more reduce slots
Map Task 2 Reduce Task
Hadoop 1.0/ 0.20 Hadoop 2.0/ 0.23
§  Each node has its memory carved up
into fixed-sized partitions based on
configured minimum
§  Map and reduce tasks run in a YARN
container (memory needed / reserved
for a task)
Task 1
Task 2
Task 3

Understanding Compute Units (Cont’d)
1 2 3 5 64
TCO

Metering / Determining Usage
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
Map Slot-Hours = #S(M1) x T(M1) +
#S(M2) x T(M2) + …
Reduce Slot-Hours = #S(R1) x T(R1)
+ #S(R2) x T(R2) + …
Cost = (M + R) Slot-Hour x $0.007 /
Slot-Hour / Month
= $ for the Job/ Month *
* Memory based approach in H0.23 is identical
(M+R) Slot-Hours for all jobs can
summed up for the month for a user,
app, BU, or the entire platform
Monthly Job
and Task Cost
Monthly Roll-
ups
Compute Storage Bandwidth
/ project (app) directory quota in
GB (peak monthly storage used)
/ user directory quota in GB (peak
monthly storage used)
/ data is accounted for as each user
accountable for their portion of use.
For e.g.
GB Read (U1)
GB Read (U1) + GB Read (U2) + …
Roll-ups through relationship
among user, file ownership, app,
and their BU
Bandwidth measured at the cluster
level and divided among select
apps and users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their BU
ILLUSTRATIVE

Hadoop Analytics Warehouse (Starling)
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling
Dashboard
Customer
Support Portal
Query Server
Source
Clusters
Warehouse
Clusters
1 2 3 5 64
TCO Usage

One Warehouse, Many Use
Information Management
§  Gather all usage data (JobHistory, Task, Job-counter etc.) from
source clusters into a central warehouse
§  1TB of raw logs processed / day, 24 TB of processed data
Business Intelligence
§  Store processed logs in HCatalog
§  Run historical analysis (Hive, Pig, MapReduce) or usage
graphs/ reports with BI tools
Predictive Analytics
§  Project growth-trend of datasets
§  Plan capacity/ headroom and CapEx across all business units
in advance
Data Storage
Efficiency
Metering and
Chargebacks
Utilization
Improvements
Product
Improvements
Best Practices
and Patterns
Tuning for
Efficiency
1 2 3 5 64
TCO Usage

Billing, Chargebacks, or Showbacks
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
BU
HDFS (Storage) Compute Network Bandwidth
Total Cost
($ M)Used
(PB)
Effective Used
(PB)
Cost
($ M)
Used
(Slot hour)
Cost
($ M)
Transferred
(GB)
Cost
($ M)
BU1 15 PB 4.0 PB $0.07 7.1 M $0.05 1.25 PB $0.02 $0.14 M
BU2 10 PB 2.7 PB $0.05 3.5 M $0.02 0.50 PB $0.01 $0.08 M
… …. … … … … … …
BU N … … … … … … ...
Total 148 PB 39.5 PB $0.75 M 71.4 M $0.5 M 5.0 PB $0.1 M $1.35 M
Resource
Unit
Aggregated / Measured
Cost

HDFS (Storage)
GB
Monthly, Peak storage used
$ 0.019/GB

Compute
Map-Reduce Slot Hours
Number of slots used by mappers and reducers and hours they ran for
$ 0.007/Slot-Hour

Network Bandwidth
GB
Monthly, total in /out
$ 0.02/GB

Hadoop Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for Sep 2013
ILLUSTRATIVE

Hadoop P&L (LTM)
1 2 3 5 64
TCO Usage P&L
Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total %
Y! Gross Revenues
Cost of revenues (less Hadoop CapEx)
Gross Profit
Hadoop OpEx
R&D Headcount
SE&O Headcount
Acquisition/Install
Active Use/ Ops
Network Bandwidth
Total Hadoop OpEx
Hadoop CapEx
Hadoop Grid Services
Total Hadoop CapEx
Contribution Margin
Indirect Costs
G&A
Sales and Marketing
ILLUSTRATIVE

Hadoop P&L (LTM)
1 2 3 5 64
TCO Usage P&L
1
1.3
1.2
1.5
0.5
0.7
0.9
1.1
1.3
1.5
1.7
Q4'12 Q1'13 Q2'13 Q3'13
HadoopCostsas%ofGrossRev
Hadoop Costs as % of Gross Profit Hadoop Cost Breakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Q4'12 Q1'13 Q2'13 Q3'13
Capex R&D HC Ops HC Other Opex
ILLUSTRATIVE
OpEx
CapEx

An Approach to Benchmarking Costs
ROIOptions
1 2 3 5 64
TCO Usage P&L Benchmark
On-Premise Public Cloud
Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.)
M/R 71.4 M 61.6 M 133 M
Compute Instances (normalized time,
RAM, 32/64 ops, I/O etc.)
1,000
instances/ hr.
HDFS 148 PB 52 PB 200 PB
Storage
(account for 3x repl., job/ app space)
30 PB/ month
Avg. Data
Processed
- - 75 PB Instance Storage 2.5 PB daily
M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M
HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M
Other Costs (if any) such as reads,
writes, data services/ hour etc.
$0.25 M
Total * $1.25 M $0.75 M $2 M Total $ 1.95 M
ILLUSTRATIVE
Quantity
equivalent
Cost
equivalent
* Ignored bandwidth, assumed equivalent

Utilization Matters
ROIOptions
1 2 3 5 64
TCO Usage P&L Benchmark
Utilization / Consumption (Compute and Storage)
TotalCost($)
On-premise Hadoop
as a Service
On-demand public
cloud service
Terms-based public
cloud service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Sensitivity analysis on
costs based on current
and expected utilization
or target utilization can
provide further insights
into your operations and
cost competitiveness
Highstartingcost
Scalingup

Focus on ROI
Time
CostAmortizedoverApps($)
Phase I 2012 – 2013 (H 0.23) 2014 & Future
Time = t Time = t’
Cost (t) = C
Cost (t’)= C’
# App continue to
grow on the Platform
At time t, BU profits are
R (t) – C(t) = π (t)
Platform’s goal is to continue to
increase the ROI while
supporting new technology and
services
R (t’) – C (t’) = π (t’), where
C (t’) < C (t) and π (t’) > π (t) for
same or bigger revenues.
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI

Cluster Organization
Options
1 2 3 5 64
Requirements
Project and Software Rollout Phases
Development, test/ staging, and go-live
Sandbox, Research, and Production Clusters
Dedicated Resources, BCP, Catch-up Capacity
Batch, Performance, Low Latency Clusters
Multi-tenant Clusters
Service Level Agreements
Processing time, failures, late data arrival
Technology Needs
MapReduce, HDFS, HBase, Storm, Spark
Multi-tenancy
Multiple customers with same tech needs
Cluster (Cost) Implications

Improving ROI with Service Innovation
Options
1 2 3 5 64
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch
Processing)
Spark
(Iterative
Processing)
Storm
(Stream
Processing)
Video
Processing
Graph
Processing
Predictive
Analytics
Coming soon on
YARN
Available
today
…
New Services on YARN
Search &
Indexing
Growing with YARN

YARN Proof Points
Options
1 2 3 5 64
tasktracker totals | mrappmaster totals
H 0.23 rollout (Jan 2013)
H 0.23 rollout (Oct 2012)
Advertising Production
Cluster
Audience Production
Historical Data Warehouse

Going Forward
Options
1 2 3 5 64
High Availability
Data processing pipelines “zero” downtime
requirement
NN HA, Rolling Upgrades, RM HA, Oozie HA
etc.
Scalability
Bigger clusters with most of the data at
one place
NN vertical and horizontal scaling
RM/ YARN
Support for long-running jobs for low
latency processing and queries
YARN support for long-running jobs, YARN API
improvements, CPU as a resource, Fluid
compute resources
Scheduling
Preempt low priority jobs to meet SLAs for
high-priority jobs
CPU based scheduling, Pre-emption with job/
app priority, Gang scheduling
Why Required What is Getting Done

Thank You
Sumeet Singh, <sumeetsi@yahoo-inc.com>
@sumeetksingh

Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business

Similar to Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business (20)

More from Sumeet Singh

More from Sumeet Singh (12)

Recently uploaded

Recently uploaded (20)

Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business