SlideShare a Scribd company logo
1 of 26
Running On-premise Hadoop as a Business
S u m e e t S i n g h
H e a d o f P r o d u c t s , C l o u d S e r v i c e s a n d H a d o o p , Ya h o o ! I n c .
Strata Conference + Hadoop World 2013, NY
Motivation
2
Public Cloud’s PopularityInfrastructure Cost
Cost bucket with no direct
revenue (private cloud),
increasing spend on newer
configs, network and services
Popularity of public clouds and
rising concern that public cloud is
a viable and cost effective
alternative to our operations
Average Utilization
Utilization on an average
considered low for good ROI, lack
of good usage metrics by BUs and
chargeback/ showback provisions
Institute Hadoop platform’s P&L
structure
Setup metering and billing
provisions
Provide transparency in costs and
benchmarks
Strata Conference + Hadoop World 2013, New York
Hadoop Evolution at Yahoo
3 Strata Conference + Hadoop World 2013, New York
0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013
RawHDFSStorage(inPB)
NumberofNodes
Year
Number of Nodes HDFS
Yahoo! Commits
to Scaling
Hadoop for
Production Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security, Multi-
tenancy, and
SLAs
Open Sourced
with Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen Hadoop
(H 0.23 YARN)
New Services
(Multi-tenant
HBase, Storm
etc.)
Increased User-
base
with partitioned
namespaces
Outline
4 Strata Conference + Hadoop World 2013, New York
Criteria and considerations for evaluating options1
Total Cost of Ownership (TCO) models for understanding true costs2
Deeper understanding of (resource) usage patterns3
P&L, metering and billing provisions4
Benchmark costs5
Improve utilization and ROI6
Deployment Models at Yahoo
5 Strata Conference + Hadoop World 2013, New York
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
Private (dedicated)
Clusters
Hosted Multi-tenant
(Private Cloud)
Clusters
Hosted Compute
Clusters
§  New technology
introduced before
platformization
§  Relic; should not be there
§  Data movement and
regulation issues
§  Acquisitions with pre-
existing use of public
cloud infrastructure
§  Exploration and learning
§  Benchmarks
§  Source of truth for all data
§  App delivery agility
§  Operational efficiency and
cost savings through
economies of scale
On-Premise Public Cloud
Criteria to Consider
6 Strata Conference + Hadoop World 2013, New York
§  Fixed, does not vary with utilization
§  Favors scale and 24x7 operation
§  Variable with usage
§  Typically favors run and done model
Cost
§  Aggregated from disparate or distributed
sources
§  Typically generated and stored in the cloudData
§  Job queues, cap. scheduling, BCP, catchup
§  Controlled latency and throughput
§  No guarantees (beyond uptime) without
provisioning additional resources
SLA
§  Control over deployed technology
§  Requires platform team/ vendor support
§  Little to no control over tech stack
§  No need for platform R&D headcount
Tech Stack
§  Shared environment, control over data and
movement, PIIs, ACLs, pluggable security
§  Data typically not shared among users in
the cloud
Security
§  Matters, complex to develop and operate
§  Does not matter, clusters are dynamic and
dedicated
Multi-tenancy
On-Premise Public CloudCriteria
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
Approach to Evaluate Options
7 Strata Conference + Hadoop World 2013, New York
TCO Usage P&L Benchmark ROIOptions
1 2 3 5 64
On-Premise
Public Cloud
Cost
Data
SLA
Tech Stack
Security
Multi-tenancy
Calculating Total Cost of Ownership
8 Strata Conference + Hadoop World 2013, New York
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
§  Headcount for service engineering and data operations teams responsible for day-to-day ops and support
6
Acquisition/ Install (One-time)
§  Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
§  Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
§  Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
§  Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
§  Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Network Bandwidth
§  Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE
Determining Unit Costs
9
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Compute
Slots or Containers
where apps can
perform computation
and access HDFS if
needed
Storage
HFDS (usable) space
needed by an app with
default replication
factor of three
Network bandwidth
needed to move
data into/out of the
clusters by the app
Bandwidth Namespace
Files and
directories used by
the apps to
understand/ limit the
load on NN
$ / Slot-Hour (H 1.0)
$ / GB-Hour (H 0.23/2.0)
Number of Slots / GBs
of Memory available for
an hour
Monthly Compute Cost
Avail. Compute Capacity
$ / GB of data stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly BW Cost
Monthly GB In + Out
N/A
N/A
N/A
Strata Conference + Hadoop World 2013, New York
Working Through An Example
10 Strata Conference + Hadoop World 2013, New York
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Monthly TCO (less bw.) = $2 M
Compute @ 50% = $1 M
185 K slots == 185 K x 24 x 30
= 133 M Slot Hours
315 TB mem == 315 TB x 24 x 30
= 227 M GB-Hours
$ 1 M / 133 M Slot-Hours
= $0.007 / Slot-Hour / Month
$1 M / 227 M GB-Hours
= $0.004 / GB-Hour / Month
Monthly TCO (less bw.) = $2 M
Storage @ 50% = $1 M
RAW HDFS = 200 PB
Usable HDFS == [ 200 x 0.8 (20%
overhead) ] / 3
= 53 PB
$ 1 M / 53 PB
= $ 0.019 / GB / Month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $0.1 M
Total Data In + Out = 5 PB
$ 0.1 M / 5 PB
= $ 0.02/ GB transferred
Compute Storage Bandwidth
ILLUSTRATIVE
Understanding Compute Units
11 Strata Conference + Hadoop World 2013, New York
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Map Task 1
Reduce Task
§  Each node in the cluster has its
memory divided into a number of map
and reduce slots
§  A map task runs in one or more map
slots, and a reduce task runs in one or
more reduce slots
Map Task 2 Reduce Task
Hadoop 1.0/ 0.20 Hadoop 2.0/ 0.23
§  Each node has its memory carved up
into fixed-sized partitions based on
configured minimum
§  Map and reduce tasks run in a YARN
container (memory needed / reserved
for a task)
Task 1
Task 2
Task 3
Understanding Compute Units (Cont’d)
12 Strata Conference + Hadoop World 2013, New York
Usage P&L Benchmark ROIOptions
1 2 3 5 64
TCO
Metering / Determining Usage
13 Strata Conference + Hadoop World 2013, New York
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
Map Slot-Hours = #S(M1) x T(M1) +
#S(M2) x T(M2) + …
Reduce Slot-Hours = #S(R1) x T(R1)
+ #S(R2) x T(R2) + …
Cost = (M + R) Slot-Hour x $0.007 /
Slot-Hour / Month
= $ for the Job/ Month *
* Memory based approach in H0.23 is identical
(M+R) Slot-Hours for all jobs can
summed up for the month for a user,
app, BU, or the entire platform
Monthly Job
and Task Cost
Monthly Roll-
ups
Compute Storage Bandwidth
/ project (app) directory quota in
GB (peak monthly storage used)
/ user directory quota in GB (peak
monthly storage used)
/ data is accounted for as each user
accountable for their portion of use.
For e.g.
GB Read (U1)
GB Read (U1) + GB Read (U2) + …
Roll-ups through relationship
among user, file ownership, app,
and their BU
Bandwidth measured at the cluster
level and divided among select
apps and users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their BU
ILLUSTRATIVE
Hadoop Analytics Warehouse (Starling)
14 Strata Conference + Hadoop World 2013, New York
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling
Dashboard
Customer
Support Portal
Query Server
Source
Clusters
Warehouse
Clusters
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
One Warehouse, Many Use
15 Strata Conference + Hadoop World 2013, New York
Information Management
§  Gather all usage data (JobHistory, Task, Job-counter etc.) from
source clusters into a central warehouse
§  1TB of raw logs processed / day, 24 TB of processed data
Business Intelligence
§  Store processed logs in HCatalog
§  Run historical analysis (Hive, Pig, MapReduce) or usage
graphs/ reports with BI tools
Predictive Analytics
§  Project growth-trend of datasets
§  Plan capacity/ headroom and CapEx across all business units
in advance
Data Storage
Efficiency
Metering and
Chargebacks
Utilization
Improvements
Product
Improvements
Best Practices
and Patterns
Tuning for
Efficiency
P&L Benchmark ROIOptions
1 2 3 5 64
TCO Usage
Billing, Chargebacks, or Showbacks
16 Strata Conference + Hadoop World 2013, New York
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
BU
HDFS (Storage) Compute Network Bandwidth
Total Cost
($ M)Used
(PB)
Effective Used
(PB)
Cost
($ M)
Used
(Slot hour)
Cost
($ M)
Transferred
(GB)
Cost
($ M)
BU1 15 PB 4.0 PB $0.07 7.1 M $0.05 1.25 PB $0.02 $0.14 M
BU2 10 PB 2.7 PB $0.05 3.5 M $0.02 0.50 PB $0.01 $0.08 M
… …. … … … … … …
BU N … … … … … … ...
Total 148 PB 39.5 PB $0.75 M 71.4 M $0.5 M 5.0 PB $0.1 M $1.35 M
Resource	
   Unit	
   Aggregated / Measured	
   Cost	
  
HDFS (Storage)	
   GB	
   Monthly, Peak storage used	
   $ 0.019/GB	
  
Compute	
   Map-Reduce Slot Hours	
   Number of slots used by mappers and reducers and hours they ran for	
   $ 0.007/Slot-Hour	
  
Network Bandwidth	
   GB	
   Monthly, total in /out	
   $ 0.02/GB	
  
Hadoop Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for Sep 2013
ILLUSTRATIVE
Hadoop P&L (LTM)
17 Strata Conference + Hadoop World 2013, New York
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total %
Y! Gross Revenues
Cost of revenues (less Hadoop CapEx)
Gross Profit
Hadoop OpEx
R&D Headcount
SE&O Headcount
Acquisition/Install
Active Use/ Ops
Network Bandwidth
Total Hadoop OpEx
Hadoop CapEx
Hadoop Grid Services
Total Hadoop CapEx
Contribution Margin
Indirect Costs
G&A
Sales and Marketing
ILLUSTRATIVE
Hadoop P&L (LTM)
18 Strata Conference + Hadoop World 2013, New York
Benchmark ROIOptions
1 2 3 5 64
TCO Usage P&L
1
1.3
1.2
1.5
0.5
0.7
0.9
1.1
1.3
1.5
1.7
Q4'12 Q1'13 Q2'13 Q3'13
HadoopCostsas%ofGrossRev
Hadoop Costs as % of Gross Profit Hadoop Cost Breakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Q4'12 Q1'13 Q2'13 Q3'13
Capex R&D HC Ops HC Other Opex
ILLUSTRATIVE
OpEx
CapEx
An Approach to Benchmarking Costs
19 Strata Conference + Hadoop World 2013, New York
ROIOptions
1 2 3 5 64
TCO Usage P&L Benchmark
On-Premise Public Cloud
Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.)
M/R 71.4 M 61.6 M 133 M
Compute Instances (normalized time,
RAM, 32/64 ops, I/O etc.)
1,000
instances/ hr.
HDFS 148 PB 52 PB 200 PB
Storage
(account for 3x repl., job/ app space)
30 PB/ month
Avg. Data
Processed
- - 75 PB Instance Storage 2.5 PB daily
M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M
HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M
Other Costs (if any) such as reads,
writes, data services/ hour etc.
$0.25 M
Total * $1.25 M $0.75 M $2 M Total $ 1.95 M
ILLUSTRATIVE
Quantity
equivalent
Cost
equivalent
* Ignored bandwidth, assumed equivalent
Utilization Matters
20 Strata Conference + Hadoop World 2013, New York
ROIOptions
1 2 3 5 64
TCO Usage P&L Benchmark
Utilization / Consumption (Compute and Storage)
TotalCost($)
On-premise Hadoop
as a Service
On-demand public
cloud service
Terms-based public
cloud service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Sensitivity analysis on
costs based on current
and expected utilization
or target utilization can
provide further insights
into your operations and
cost competitiveness
Highstartingcost
Scalingup
Focus on ROI
21 Strata Conference + Hadoop World 2013, New York
Time
CostAmortizedoverApps($)
Phase I 2012 – 2013 (H 0.23) 2014 & Future
Time = t Time = t’
Cost (t) = C
Cost (t’)= C’
# App continue to
grow on the Platform
At time t, BU profits are
R (t) – C(t) = π (t)
Platform’s goal is to continue to
increase the ROI while
supporting new technology and
services
R (t’) – C (t’) = π (t’), where
C (t’) < C (t) and π (t’) > π (t) for
same or bigger revenues.
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
Cluster Organization
22 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
Requirements
Project and Software Rollout Phases
Development, test/ staging, and go-live
Sandbox, Research, and Production Clusters
Dedicated Resources, BCP, Catch-up Capacity
Batch, Performance, Low Latency Clusters
Multi-tenant Clusters
Service Level Agreements
Processing time, failures, late data arrival
Technology Needs
MapReduce, HDFS, HBase, Storm, Spark
Multi-tenancy
Multiple customers with same tech needs
Cluster (Cost) Implications
Improving ROI with Service Innovation
23 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch
Processing)
Spark
(Iterative
Processing)
Storm
(Stream
Processing)
Video
Processing
Graph
Processing
Predictive
Analytics
Coming soon on
YARN
Available
today
…
New Services on YARN
Search &
Indexing
Growing with YARN
YARN Proof Points
24 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
tasktracker totals | mrappmaster totals
H 0.23 rollout (Jan 2013)
H 0.23 rollout (Oct 2012)
Advertising Production
Cluster
Audience Production
Historical Data Warehouse
Going Forward
25 Strata Conference + Hadoop World 2013, New York
Options
1 2 3 5 64
TCO Usage P&L Benchmark ROI
High Availability
Data processing pipelines “zero” downtime
requirement
NN HA, Rolling Upgrades, RM HA, Oozie HA
etc.
Scalability
Bigger clusters with most of the data at
one place
NN vertical and horizontal scaling
RM/ YARN
Support for long-running jobs for low
latency processing and queries
YARN support for long-running jobs, YARN API
improvements, CPU as a resource, Fluid
compute resources
Scheduling
Preempt low priority jobs to meet SLAs for
high-priority jobs
CPU based scheduling, Pre-emption with job/
app priority, Gang scheduling
Why Required What is Getting Done
26 Strata Conference + Hadoop World 2013, New York
Thank You
Sumeet Singh, <sumeetsi@yahoo-inc.com>
@sumeetksingh

More Related Content

What's hot

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsRim Moussa
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseDan Han
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence GeneratorRim Moussa
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineG. Bruce Berriman
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 

What's hot (20)

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 

Similar to Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADXRiccardo Zamana
 
New Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonNew Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonPaul Hofmann
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
 
The True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS OptionsThe True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS OptionsScyllaDB
 
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...apidays
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 

Similar to Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business (20)

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 
New Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonNew Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @Wharton
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
The True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS OptionsThe True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS Options
 
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 

More from Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

More from Sumeet Singh (12)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Business

  • 1. Running On-premise Hadoop as a Business S u m e e t S i n g h H e a d o f P r o d u c t s , C l o u d S e r v i c e s a n d H a d o o p , Ya h o o ! I n c . Strata Conference + Hadoop World 2013, NY
  • 2. Motivation 2 Public Cloud’s PopularityInfrastructure Cost Cost bucket with no direct revenue (private cloud), increasing spend on newer configs, network and services Popularity of public clouds and rising concern that public cloud is a viable and cost effective alternative to our operations Average Utilization Utilization on an average considered low for good ROI, lack of good usage metrics by BUs and chargeback/ showback provisions Institute Hadoop platform’s P&L structure Setup metering and billing provisions Provide transparency in costs and benchmarks Strata Conference + Hadoop World 2013, New York
  • 3. Hadoop Evolution at Yahoo 3 Strata Conference + Hadoop World 2013, New York 0 50 100 150 200 250 300 350 400 450 500 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 2013 RawHDFSStorage(inPB) NumberofNodes Year Number of Nodes HDFS Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (Multi-tenant HBase, Storm etc.) Increased User- base with partitioned namespaces
  • 4. Outline 4 Strata Conference + Hadoop World 2013, New York Criteria and considerations for evaluating options1 Total Cost of Ownership (TCO) models for understanding true costs2 Deeper understanding of (resource) usage patterns3 P&L, metering and billing provisions4 Benchmark costs5 Improve utilization and ROI6
  • 5. Deployment Models at Yahoo 5 Strata Conference + Hadoop World 2013, New York TCO Usage P&L Benchmark ROIOptions 1 2 3 5 64 Private (dedicated) Clusters Hosted Multi-tenant (Private Cloud) Clusters Hosted Compute Clusters §  New technology introduced before platformization §  Relic; should not be there §  Data movement and regulation issues §  Acquisitions with pre- existing use of public cloud infrastructure §  Exploration and learning §  Benchmarks §  Source of truth for all data §  App delivery agility §  Operational efficiency and cost savings through economies of scale On-Premise Public Cloud
  • 6. Criteria to Consider 6 Strata Conference + Hadoop World 2013, New York §  Fixed, does not vary with utilization §  Favors scale and 24x7 operation §  Variable with usage §  Typically favors run and done model Cost §  Aggregated from disparate or distributed sources §  Typically generated and stored in the cloudData §  Job queues, cap. scheduling, BCP, catchup §  Controlled latency and throughput §  No guarantees (beyond uptime) without provisioning additional resources SLA §  Control over deployed technology §  Requires platform team/ vendor support §  Little to no control over tech stack §  No need for platform R&D headcount Tech Stack §  Shared environment, control over data and movement, PIIs, ACLs, pluggable security §  Data typically not shared among users in the cloud Security §  Matters, complex to develop and operate §  Does not matter, clusters are dynamic and dedicated Multi-tenancy On-Premise Public CloudCriteria TCO Usage P&L Benchmark ROIOptions 1 2 3 5 64
  • 7. Approach to Evaluate Options 7 Strata Conference + Hadoop World 2013, New York TCO Usage P&L Benchmark ROIOptions 1 2 3 5 64 On-Premise Public Cloud Cost Data SLA Tech Stack Security Multi-tenancy
  • 8. Calculating Total Cost of Ownership 8 Strata Conference + Hadoop World 2013, New York $2.1 M 60% 12% 7% 6% 3% 2% 6 5 4 3 2 1 7 10% Operations Engineering §  Headcount for service engineering and data operations teams responsible for day-to-day ops and support 6 Acquisition/ Install (One-time) §  Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc. 5 Network Hardware §  Aggregated network component costs, including switches, wiring, terminal servers, power strips etc. 4 Active Use and Operations (Recurring) §  Recurring datacenter ops cost (power, space, labor support, and facility maintenance 3 R&D HC §  Headcount for platform software development, quality, and release engineering 2 Cluster Hardware §  Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers 1 Monthly TCOTCO Components Usage P&L Benchmark ROIOptions 1 2 3 5 64 TCO Network Bandwidth §  Data transferred into and out of clusters for all colos, including cross-colo transfers 7 ILLUSTRATIVE
  • 9. Determining Unit Costs 9 Usage P&L Benchmark ROIOptions 1 2 3 5 64 TCO Compute Slots or Containers where apps can perform computation and access HDFS if needed Storage HFDS (usable) space needed by an app with default replication factor of three Network bandwidth needed to move data into/out of the clusters by the app Bandwidth Namespace Files and directories used by the apps to understand/ limit the load on NN $ / Slot-Hour (H 1.0) $ / GB-Hour (H 0.23/2.0) Number of Slots / GBs of Memory available for an hour Monthly Compute Cost Avail. Compute Capacity $ / GB of data stored Usable storage space (less replication and overheads) Monthly Storage Cost Avail. Usable Storage Unit Total Capacity Unit Cost $ / GB for Inter-region data transfers Inter-region (peak) link capacity Monthly BW Cost Monthly GB In + Out N/A N/A N/A Strata Conference + Hadoop World 2013, New York
  • 10. Working Through An Example 10 Strata Conference + Hadoop World 2013, New York Usage P&L Benchmark ROIOptions 1 2 3 5 64 TCO Monthly TCO (less bw.) = $2 M Compute @ 50% = $1 M 185 K slots == 185 K x 24 x 30 = 133 M Slot Hours 315 TB mem == 315 TB x 24 x 30 = 227 M GB-Hours $ 1 M / 133 M Slot-Hours = $0.007 / Slot-Hour / Month $1 M / 227 M GB-Hours = $0.004 / GB-Hour / Month Monthly TCO (less bw.) = $2 M Storage @ 50% = $1 M RAW HDFS = 200 PB Usable HDFS == [ 200 x 0.8 (20% overhead) ] / 3 = 53 PB $ 1 M / 53 PB = $ 0.019 / GB / Month Monthly Cost Monthly Capacity Unit Cost Monthly Charges = $0.1 M Total Data In + Out = 5 PB $ 0.1 M / 5 PB = $ 0.02/ GB transferred Compute Storage Bandwidth ILLUSTRATIVE
  • 11. Understanding Compute Units 11 Strata Conference + Hadoop World 2013, New York Usage P&L Benchmark ROIOptions 1 2 3 5 64 TCO Map Task 1 Reduce Task §  Each node in the cluster has its memory divided into a number of map and reduce slots §  A map task runs in one or more map slots, and a reduce task runs in one or more reduce slots Map Task 2 Reduce Task Hadoop 1.0/ 0.20 Hadoop 2.0/ 0.23 §  Each node has its memory carved up into fixed-sized partitions based on configured minimum §  Map and reduce tasks run in a YARN container (memory needed / reserved for a task) Task 1 Task 2 Task 3
  • 12. Understanding Compute Units (Cont’d) 12 Strata Conference + Hadoop World 2013, New York Usage P&L Benchmark ROIOptions 1 2 3 5 64 TCO
  • 13. Metering / Determining Usage 13 Strata Conference + Hadoop World 2013, New York P&L Benchmark ROIOptions 1 2 3 5 64 TCO Usage Map Slot-Hours = #S(M1) x T(M1) + #S(M2) x T(M2) + … Reduce Slot-Hours = #S(R1) x T(R1) + #S(R2) x T(R2) + … Cost = (M + R) Slot-Hour x $0.007 / Slot-Hour / Month = $ for the Job/ Month * * Memory based approach in H0.23 is identical (M+R) Slot-Hours for all jobs can summed up for the month for a user, app, BU, or the entire platform Monthly Job and Task Cost Monthly Roll- ups Compute Storage Bandwidth / project (app) directory quota in GB (peak monthly storage used) / user directory quota in GB (peak monthly storage used) / data is accounted for as each user accountable for their portion of use. For e.g. GB Read (U1) GB Read (U1) + GB Read (U2) + … Roll-ups through relationship among user, file ownership, app, and their BU Bandwidth measured at the cluster level and divided among select apps and users of data based on average volume In/Out Roll-ups through relationship among user, app, and their BU ILLUSTRATIVE
  • 14. Hadoop Analytics Warehouse (Starling) 14 Strata Conference + Hadoop World 2013, New York Cluster 1 Cluster 2 Cluster 3 Cluster N Oozie HCatalog HDFS Hive Starling Dashboard Customer Support Portal Query Server Source Clusters Warehouse Clusters P&L Benchmark ROIOptions 1 2 3 5 64 TCO Usage
  • 15. One Warehouse, Many Use 15 Strata Conference + Hadoop World 2013, New York Information Management §  Gather all usage data (JobHistory, Task, Job-counter etc.) from source clusters into a central warehouse §  1TB of raw logs processed / day, 24 TB of processed data Business Intelligence §  Store processed logs in HCatalog §  Run historical analysis (Hive, Pig, MapReduce) or usage graphs/ reports with BI tools Predictive Analytics §  Project growth-trend of datasets §  Plan capacity/ headroom and CapEx across all business units in advance Data Storage Efficiency Metering and Chargebacks Utilization Improvements Product Improvements Best Practices and Patterns Tuning for Efficiency P&L Benchmark ROIOptions 1 2 3 5 64 TCO Usage
  • 16. Billing, Chargebacks, or Showbacks 16 Strata Conference + Hadoop World 2013, New York Benchmark ROIOptions 1 2 3 5 64 TCO Usage P&L BU HDFS (Storage) Compute Network Bandwidth Total Cost ($ M)Used (PB) Effective Used (PB) Cost ($ M) Used (Slot hour) Cost ($ M) Transferred (GB) Cost ($ M) BU1 15 PB 4.0 PB $0.07 7.1 M $0.05 1.25 PB $0.02 $0.14 M BU2 10 PB 2.7 PB $0.05 3.5 M $0.02 0.50 PB $0.01 $0.08 M … …. … … … … … … BU N … … … … … … ... Total 148 PB 39.5 PB $0.75 M 71.4 M $0.5 M 5.0 PB $0.1 M $1.35 M Resource   Unit   Aggregated / Measured   Cost   HDFS (Storage)   GB   Monthly, Peak storage used   $ 0.019/GB   Compute   Map-Reduce Slot Hours   Number of slots used by mappers and reducers and hours they ran for   $ 0.007/Slot-Hour   Network Bandwidth   GB   Monthly, total in /out   $ 0.02/GB   Hadoop Services Billing Rate Card [ Monthly Rates ] Monthly Bill for Sep 2013 ILLUSTRATIVE
  • 17. Hadoop P&L (LTM) 17 Strata Conference + Hadoop World 2013, New York Benchmark ROIOptions 1 2 3 5 64 TCO Usage P&L Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total % Y! Gross Revenues Cost of revenues (less Hadoop CapEx) Gross Profit Hadoop OpEx R&D Headcount SE&O Headcount Acquisition/Install Active Use/ Ops Network Bandwidth Total Hadoop OpEx Hadoop CapEx Hadoop Grid Services Total Hadoop CapEx Contribution Margin Indirect Costs G&A Sales and Marketing ILLUSTRATIVE
  • 18. Hadoop P&L (LTM) 18 Strata Conference + Hadoop World 2013, New York Benchmark ROIOptions 1 2 3 5 64 TCO Usage P&L 1 1.3 1.2 1.5 0.5 0.7 0.9 1.1 1.3 1.5 1.7 Q4'12 Q1'13 Q2'13 Q3'13 HadoopCostsas%ofGrossRev Hadoop Costs as % of Gross Profit Hadoop Cost Breakdown 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Q4'12 Q1'13 Q2'13 Q3'13 Capex R&D HC Ops HC Other Opex ILLUSTRATIVE OpEx CapEx
  • 19. An Approach to Benchmarking Costs 19 Strata Conference + Hadoop World 2013, New York ROIOptions 1 2 3 5 64 TCO Usage P&L Benchmark On-Premise Public Cloud Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.) M/R 71.4 M 61.6 M 133 M Compute Instances (normalized time, RAM, 32/64 ops, I/O etc.) 1,000 instances/ hr. HDFS 148 PB 52 PB 200 PB Storage (account for 3x repl., job/ app space) 30 PB/ month Avg. Data Processed - - 75 PB Instance Storage 2.5 PB daily M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M Other Costs (if any) such as reads, writes, data services/ hour etc. $0.25 M Total * $1.25 M $0.75 M $2 M Total $ 1.95 M ILLUSTRATIVE Quantity equivalent Cost equivalent * Ignored bandwidth, assumed equivalent
  • 20. Utilization Matters 20 Strata Conference + Hadoop World 2013, New York ROIOptions 1 2 3 5 64 TCO Usage P&L Benchmark Utilization / Consumption (Compute and Storage) TotalCost($) On-premise Hadoop as a Service On-demand public cloud service Terms-based public cloud service Favors on-premise Hadoop as a Service Favors public cloud service x x Sensitivity analysis on costs based on current and expected utilization or target utilization can provide further insights into your operations and cost competitiveness Highstartingcost Scalingup
  • 21. Focus on ROI 21 Strata Conference + Hadoop World 2013, New York Time CostAmortizedoverApps($) Phase I 2012 – 2013 (H 0.23) 2014 & Future Time = t Time = t’ Cost (t) = C Cost (t’)= C’ # App continue to grow on the Platform At time t, BU profits are R (t) – C(t) = π (t) Platform’s goal is to continue to increase the ROI while supporting new technology and services R (t’) – C (t’) = π (t’), where C (t’) < C (t) and π (t’) > π (t) for same or bigger revenues. Options 1 2 3 5 64 TCO Usage P&L Benchmark ROI
  • 22. Cluster Organization 22 Strata Conference + Hadoop World 2013, New York Options 1 2 3 5 64 TCO Usage P&L Benchmark ROI Requirements Project and Software Rollout Phases Development, test/ staging, and go-live Sandbox, Research, and Production Clusters Dedicated Resources, BCP, Catch-up Capacity Batch, Performance, Low Latency Clusters Multi-tenant Clusters Service Level Agreements Processing time, failures, late data arrival Technology Needs MapReduce, HDFS, HBase, Storm, Spark Multi-tenancy Multiple customers with same tech needs Cluster (Cost) Implications
  • 23. Improving ROI with Service Innovation 23 Strata Conference + Hadoop World 2013, New York Options 1 2 3 5 64 TCO Usage P&L Benchmark ROI HDFS (File System) YARN (Resource Manager) MapReduce (Batch Processing) Spark (Iterative Processing) Storm (Stream Processing) Video Processing Graph Processing Predictive Analytics Coming soon on YARN Available today … New Services on YARN Search & Indexing Growing with YARN
  • 24. YARN Proof Points 24 Strata Conference + Hadoop World 2013, New York Options 1 2 3 5 64 TCO Usage P&L Benchmark ROI tasktracker totals | mrappmaster totals H 0.23 rollout (Jan 2013) H 0.23 rollout (Oct 2012) Advertising Production Cluster Audience Production Historical Data Warehouse
  • 25. Going Forward 25 Strata Conference + Hadoop World 2013, New York Options 1 2 3 5 64 TCO Usage P&L Benchmark ROI High Availability Data processing pipelines “zero” downtime requirement NN HA, Rolling Upgrades, RM HA, Oozie HA etc. Scalability Bigger clusters with most of the data at one place NN vertical and horizontal scaling RM/ YARN Support for long-running jobs for low latency processing and queries YARN support for long-running jobs, YARN API improvements, CPU as a resource, Fluid compute resources Scheduling Preempt low priority jobs to meet SLAs for high-priority jobs CPU based scheduling, Pre-emption with job/ app priority, Gang scheduling Why Required What is Getting Done
  • 26. 26 Strata Conference + Hadoop World 2013, New York Thank You Sumeet Singh, <sumeetsi@yahoo-inc.com> @sumeetksingh