Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deployments

Capacity Planning in Multi-tenant
Hadoop, HBase and Storm Deployments
PRESENTED BY Amrit Lal and Sumeet Singh ⎪ April 02, 2014
2 0 1 4 H a d o o p S u m m i t , A m s t e r d a m , N e t h e r l a n d s

Introduction
2 2014 Hadoop Summit, Amsterdam, Netherlands
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
§  Product Manager at Yahoo engaged in building
high class and robust Hadoop infrastructure
services
§  Eight years of experience across HSBC, Oracle
and Google in developing products and
platforms for high growth enterprises
§  M.B.A. from Carnegie Mellon University701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
§  Manages Hadoop products team at Yahoo!
§  Responsible for Product Management, Strategy
and Customer Engagements
§  Managed Cloud Services products team and
headed Strategy functions for the Cloud
Platform Group at Yahoo
§  M.B.A. from UCLA and M.S. from
Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh

Agenda
The Need for Capacity Planning1
Big Data Platform Deployment Models2
Resource Drivers and Data Sources3
Capacity Models and Tools4
SLA Management5

0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers(DataNode)
Year
Servers Storage
Multi-tenant Apache Hadoop Platform Evolution
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-tenancy,
and SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H2.x
(Low latency,
Util, HA etc.)

Hosted Apps Growth on Apache Hadoop
272
288
306
330
336
357
368
382
407
449
460
495
260
310
360
410
460
510
Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13
NumberofNewProjects
New Customer Apps On-boarded
67 projects
in 2011
52 projects
in 2012
113 projects
in 2013

Multi-tenant Apache HBase Growth at Yahoo
1,140
33.6 PB
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0
200
400
600
800
1,000
1,200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofServers(RegionServer)
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage

Multi-tenant Apache Storm Growth at Yahoo
Zero to “175” Production Topologies in a Year
760
175
0
20
40
60
80
100
120
140
160
180
200
0
100
200
300
400
500
600
700
800
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
NumberofTopologies
NumberofServers(Supervisor)
Supervisor Topologies
Multi-tenancy
Release

Where Does Capacity Planning Fit
Phased
Environment
Production
On-boarding
Capacity
Planning
Architecture
Validation
Technology
Choice
Project Lifecycle Support

Big Data Platform Technology Stack at Yahoo
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
Relevant for Capacity Planning

Deployment Model
DataNode NodeManager
NameNode RM
DataNode RegionServer
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Relevant for Capacity Planning

Capacity Drivers That Matter
Data (Storage) Volume of data to be stored and processed
Memory Container for direct and faster access to stored data
CPU Cores (and threads) available for processing
Throughput Number of transactions per second
Latency
Time taken to complete a request or operation ((includes
processing, disk and network I/O time)
Drivers Measure

Apache Hadoop Resources
Data (Storage) Data stored in HDFS (disk)
Memory
Map and Reduce containers
(in H 0.23/ 2.0)
CPU
YARN-2 for Capacity Scheduler,
Yahoo is not using it yet
Throughput
Latency
Time taken for the jobs to
complete
§ Freq., size, retention, # files
§ Rep. factor
§ Map memory
§ Reduce memory
§ N/A
§ Individual job run times
§ Time to finish all jobs (when
run in parallel) – peak usage
Drivers Measure
Data processed/ second with
concurrent Mappers and Reducers
§ Total data processed
§ Maps and Reduces to run
(simple or complex DAGs)
IntheorderofimportanceforHadoop

Working Through a Use Case
Pig Mail needs to process 30 TB of data
everyday in about 6 hours so that it can
develop algorithms that can detect spam
more effectively. A Pig script will parse the
data in sequential phases to finally
materialize the features of the mail that
decides if the mail is a SPAM.
1
3
2-L 2-R
Stage 1
Stage 2
Stage 3
Pig DAG
ILLUSTRATIVE

Data (Storage)
Step 1: Pig Mail Project Info – User Input
Data upload frequency Once daily
Data added per upload 1 TB / day
Data retention (Input) 30 days
Data output 50 GB
Data retention 1 day
Anticipated growth in data volume (3-6 months) 20%
Step 2: # Servers Based on Storage (default values at hdfs-site.xml)
HDFS replication factor
dfs.replication
Default: <3>
HDFS required (30 + 0.05) x 1.2 x 3 = 108 TB
Suggested server config (based on total cost) C-xxx/48/4000 (four 4 TB disks)
Storage available per server
12 TB out of 16 TB (rest for OS, temp, swap etc.)
dfs.datanode.du.reserved, <107374182400> 1 TB
Servers required 108 / 12 = 9 servers
Step 3: Namespace Needed (default values at hdfs-site.xml)
HDFS block size
dfs.blocksize
Default: <134217728> 128 MB
Average file size 1.5 X 128 MB = 200 MB (assumed)
Namespace for files 108 TB / 200 MB = 540,000

Memory
Step 1: Cluster/ Node Level Info (configured values at yarn-site.xml) – Admins Only
Max memory on the node for containers
yarn.nodemanager.resource.memory-mb
Conf: <45056> (44G out of 48G, rest for the OS)
Virtual to physical memory
yarn.nodemanager.vmem-pmem-ratio
Default: <2.1> (2:1 virtual to exceed physical by)
Min allocable memory for containers
yarn.scheduler.minimum-allocation-mb
Default: <512> (0.5G)
Max allocable memory for containers
yarn.scheduler.maximum-allocation-mb
Default: <8192> (8G)
Step 2: Container Level Info (default values at mapred-site.xml)
Map task container size
mapreduce.map.memory.mb
Default: <1536> (1.5G)
Reduce task container size
mapreduce.reduce.memory.mb
Default: <2048> (2G)
MR AppMaster memory size
yarn.app.mapreduce.am.resource.mb
Default: <1536> (1.5G)
Map task JVM heap size
mapreduce.map.java.opts
Default: Xmx1024m
Reduce task JVM heap size
mapreduce.reduce.java.opts
Default: Xmx1536m
Map and Reduce container sizes are determined by users developing the app based on memory needs of the tasks

Throughput
Step 1: Estimating Number of Mappers
Upper bound on input splits mapreduce.input.fileinputformat.split.maxsize
Lower bound on input splits mapreduce.input.fileinputformat.split.minsize
Number of mappers
Number of input splits
(e.g. 8,192 maps = 1 TB of data / 128M split size)
Step 2 A: Estimating Number of Reducers
Limit on the input size to reducers
mapreduce.reduce.input.limit
Default: <10737418240> (10G)
Fixed number of reducers mapreduce.job.reduces
Number of reducers Min (fixed reducers, total input size / reducer size)
Step 2 B: Estimating Number of Reducers (Pig and Hive)
Pig
Min (fixed reducers, pig.exec.reducers.max,
total input size / pig.exec.reducers.bytes.per.reducer)
Default: <max 999, reducer bytes 1GB>
Hive
Min (fixed reducers, hive.exec.reducers.max ,
total input size / hive.exec.reducers.bytes.per.reducer)
Default: < max 999, reducer bytes 1GB>

Throughput and Latency
Step 1: Sample Run (with a tenth of data on a sandbox cluster)
Stages # Map Map Size Map Time # Reduce Reduce Size Reduce Time
Stage 1 100 1.5 GB 10 Min 50 2 GB 5 Min
Stage 2 - L 50 1.5 GB 10 Min 20 2 GB 10 Min
Stage 2 - R 30 1.5 GB 5 Min 10 2 GB 5 Min
Stage 3 70 1.5 GB 5 Min 30 2 GB 5 Min
Notes:
§  SLOT_MILLIS_MAPS and SLOT_MILLIS_REDUCES from Job Counters gives the time spent
§  TOTAL_LAUNCHED_MAPS and TOTAL_LAUNCHED_REDUCES from Job Counters gives # Map and # Reduce
§  Reduce time includes the Sort and Shuffle time. Shuffle Time is Data per Reducer / est. 4 MB/s (bandwidth for
data transfer from Map to Reduce)
§  Add 10% for speculative execution (failed/killed task attempts)
Step 2: Mappers and Reducers for SLA and Full Dataset
Stages Mins SLA Share # Map # Reduce
Map
Total
Reduce
Total
Total
Mem.
#
Servers
Stage 1 15 / 45 Min 120 / 360 Min 138
(100 x 11) / 8
69
(50 x 11) / 8
207 GB 138 GB 345 GB 8
Project Pig Mail Capacity Ask = MAX (Compute <8 Servers>, Storage <9 Servers>) = 9 Servers

Capacity Calculations Tools

Apache HBase Resources
Throughput
Supported frequency of data read
or written in a second (for a given
record size)
Latency
Time taken for the read, write or
scan operations to complete
Memory
BlockCache; data that needs to
be served through cache
Data (Storage)
CPU N/A
§ Number of reads, writes or
scans per second per server
§ Read or write time in ms
(typically) per record
§ % of data read from cache
§ MemStore / BlockCache
ratio, RegionServer heap
§ N/A
Drivers Measure
Total data stored in HDFS (disk)
§ Avg. record size x avg.
number of records stored
IntheorderofimportanceforHBase

Awesome eCommerce needs to
process about 200 M records daily
somewhere between 6:00 - 10:00 AM to
update product information. About 50%
of the data is related to existing
products where price may need to be
updated by comparing current with the
new offer price. Remaining 50% of the
offer is new products and will be written
without price comparison.
There are three separate tables for
product, price and offers with 3 KB avg.
record size. Writes are in the order of
500 Million records and reads 250
Million across each of the three tables.
ILLUSTRATIVE

Throughput & Latency
Step 1: Project Info – User Input
Active reads/writes per day 4 Hrs.
Avg. writes / day (all three tables) 1,500 M
Avg. reads / day (all three tables) 750 M
Average record size 3 KB
Records cached / warmed on start 50%
Step 2: # Servers Based on Write Throughput
Peak concurrent writes required 1,500 M x 3 KB / (4 x 3,600 sec) = ~ 300 MB / sec
Peak write throughput per RegionServer 45 MB / sec (based on performance benchmarks)
Servers required 300 / 45 = 7 RegionServer
Step 3: # Servers Based on Read Throughput
Peak concurrent reads required 750 M x 3 KB / (4x3600 sec) = ~160 MB / sec
Peak cold random read throughput 10 MB / sec (based on performance benchmarks)
Peak hot random read throughput 200 MB / sec (based on performance benchmarks)
RegionServer for cold reads 160 x 50% / 10 = 8
RegionServer for hot read 160 x 50% / 200 = 1
Servers required Max (8,1) = 8 RegionServer
Performance benchmarks were conducted by simulating HBase workloads through YCSB on dedicated servers

Memory
Step 1: RegionServer Info (configured values at hbase-site.xml and hbase-env.sh) – Admins Only
Max memory available per Region Server
C-xxx/64/4000
<64 GB>
Heap size of the Region Server JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Memory allocated to BlockCache
hfile.block.cache.size = 0.8 (80%)
Default: <0.4> (40% of Heap)
Memory allocated to Memstore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Step 2: Servers required to serve from block cache
Total records 200 M
Average record size 3 KB
Total data served 200 M x 3 KB = 0.55 TB
Total data served through BlockCache 0.55 TB x 50% = 0.28 TB
Loading factor in the (LRU) BlockCache (in HBase 0.94) 85 %
Total BlockCache available per RegionServer 58 GB x 0.8 x 85% = 40 GB
Servers required 0.28 TB / 40 GB = 7 RegionServer
Block cache allocation is dependent on the mix of reads and writes access patterns. Remainder of LRU is used by
other resident users such as catalog tables, hfiles indexes, bloom filters

Data
Step 2: # Servers Based on data served
Raw disk space to JVM heap / RegionServer 10 GB / 128 MB x 3 x 0.2 = 48
Raw disk space available / RegionServer 48 x 58 GB x 0.2 = 0.56 TB
Total data served through tables 0.55 TB
Total raw data served 0.55 TB x 3 = 1.65 TB
Servers required 1.65 / 0.56 = 3 servers
Step 1: RegionServer Info (configured values at hbase-site.xml & hbase-env.sh) – Admins Only
Max memory available per RegionServer C-xxx/64/4000 (four 4 TB disks) = 64 GB
Heap size of the RegionServer JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Region size
hbase.hregion.max.filesize = 10737418240
Default: <10737418240> (10 GB)
Memory allocated to MemStore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Memstore flush size
hbase.hregion.memstore.flush.size= 134217728
Default: <134217728> (128 MB)
HDFS replication factor
dfs.replication = 3
Default: <3>
Project Awesome eCommerce Ask = MAX (Write <7 RS>, Read <8 RS>, Cached<7 RS >, Data <3 RS>) = 8 RS

Capacity Calculations Tools

Apache Storm Resources
Throughput
Events processed per second or
parallel workers
Memory
Worker/ Slot memory for spouts
and bolts
CPU
CPU threads needed for workers/
executors
Latency
Data (Storage) N/A
§ # events, # messages / sec
§ Tuples / sec
§ Spout and bolt JVM size
§ Message and Tuple size
§ Cores for spout and bolt
processes, inter and intra
§ Inter and Intra worker
comm.
§ N/A
Drivers Measure
Time taken for processing the
input stream of events
§ Execute / complete latency
IntheorderofimportanceforStorm

Wonder Search wants to index editorial
content in near real-time for users to be able
to search content. The editorial content is
available in Apache HBase.
Spout: Scans HBase since the last scan till
current time to get the editorial content.
Bolt 1: Build the index and store it back in
HBase.
Bolt 2: Push the index for serving.
ILLUSTRATIVE

Throughput and Latency
Step 1: Supervisor Level Info (configured values at storm.yaml or multitenant-scheduler.yaml) – Admins Only
Incoming (worker) messages queue size topology.receiver.buffer.size, Default: <8>
Outgoing (worker) messages queue size topology.transfer.buffer.size, Default: <1024>
Incoming (executor) tuple queue size topology.executor.receive.buffer.size, Default: <1024>
Outgoing (executor) tuple queue size topology.executor.send.buffer.size, Default: <1024>
Slots available per supervisor
supervisor.slots.ports
<24>, hyper-threaded cores for dual hex-core machines
Multi-tenant scheduler (user isolation scheduler)
multitenant.scheduler.user.pools: <users> : <# nodes>,
topology.isolate.machines: <Number of Nodes>
Step 2: # Servers Based on Throughput
Events processed with single spout per worker 1,000 messages / sec
Target throughput required 8,000 messages / sec
Number of spout executors required 8,000 / 1,000 = 8 (across 8 slots)
Number of tuple executed across 1st bolt (5 executors) 10,000 tuples / sec
Total executors required for 1st bolt 8 x 5 = 40 (across 40 slots)
Number of tuples executed across 2nd bolt (5 executors) 15,000 tuples / sec
Total executors required for 2nd Bolt 8 x 5 = 40 (across 40 slots)
Total slots based on executors 8 + 40 + 40 = 88 Slots
Number of supervisors required 88 / 24 = 4 servers

CPU vs. Throughput
Step 1: Track CPU usage with JVM tools (jmap/ jstack)
Max CPU cores per supervisor C-xxx/48/4000 (12 physical cores)
CPU usage for 1000 messages / sec
4 physical cores (32.12%)
Includes 1 spout and 5 bolt executors each for bolts 1
and 2, and CPU usage for inter-messaging (ZeroMQ or
Netty)
Equal CPU division between spout and bolt executor
(assumed)
Executor CPU needs = 4 / (1+5+5) = 4/11 cores
Total workers
TOPOLOGY_WORKERS
Config#setNumWorkers
Tasks per component
TOPOLOGY_TASKS
ComponentConfigurationDeclarer#setNumTasks()
Step 2: Extrapolate for Target Throughput (linear increase)
Target spout executors 8, TopologyBuilder#setSpout()
Target bolt executors 40, TopologyBuilder#setBolt()
CPU needed for spout executors 8 x 4/11 = 3 cores
CPU needed for 1st bolt executors 40 x 4/11 = 15 cores
CPU needed for 2nd bolt executors 40 x 4/11 = 15 cores
CPU need for the topology 3 + 15 + 15 = 33 cores
Total supervisors needed 33 /12 = 3 servers

Memory vs. Throughput
Step 1: Supervisor Level Info
Max memory available per supervisor node
C-xxx/48/4000 <48 GB>
(Usable 42G out of 48G, rest for the OS)
Step 2: # Servers Based on Memory needs
Events processed across spout executors 8,000 messages / sec
Avg. event or message size 3 MB
Data processed per second across spout executors 8,000 x 3 MB = 24 GB / sec
Events processed per second across 1st bolt executors 10,000 x 8 = 80,000 tuples / sec
Average tuple size 100 KB
Data processed per second across 1st bolt executors 80,000 tuples / sec x 100 KB = 8 GB / sec
Data processed per second across 2nd bolt executors 15,000 x 8 tuples / sec x 100 KB = 12 GB / sec
Total data processed 24 GB / sec + 8 GB / sec + 12 GB / sec = 44 GB / sec
Number of Supervisors required to process data 44 / 42 = 2 server
Project Wonder Search Ask = MAX (Throughput <4 Servers>, CPU <3 Servers>, Memory <2 Server >= 4 Servers

Capacity Calculation Tools

On-going SLA Management
queue 2
queue 1
queue 3
queue 4
queue 5
queue 6
queue 7
queue 8
queue 11
queue 9
queue 10
SLA Dashboard on Hadoop Analytics Warehouse

Growing with YARN
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch)
Spark
(Iterative)
Storm
(Stream)
HBaseGiraph
R, OpenMPI,
Indexing etc.
Coming soon
on YARN
Available
today
…
New Services on YARN
Tez
(DAGs)

Near Future for Capacity Planning
Hadoop HBase Storm
§  CPU as a resource
§  Container reuse
§  Long-running jobs
§  Other potential
resources such as disk,
network, GPUs etc.
§  Tez as the execution
engine
§  Spark-on-YARN etc.
§  BlockCache
implementations
§  LRU
§  Slab
§  Bucket
§  Short circuit reads
§  Bloom filters and co-
processors
§  HBase-on-YARN
§  Storm-on-YARN
§  More experience with
multi-tenancy

Acknowledgement
Hadoop Capacity Planning
Nathan Roberts Hadoop Core Architect
Koji Noguchi Software Engineer
Viraj Bhat Software Engineer
Ryota Egashiri Software Engineer
Balaji Narayan Service Engineer
Anish Matthew Service Engineer
Rajiv Chittajallu SE Architect
HBase Capacity Planning
Francis Liu Software Engineer
Dheeraj Kapur Service Engineer
Storm Capacity Planning
Bobby Evans Software Engineer
Dheeraj Kapur Service Engineer

Thank You
@ s u m e e t k s i n g h

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deployments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deployments

Similar to Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deployments (20)

More from Sumeet Singh

More from Sumeet Singh (12)

Recently uploaded

Recently uploaded (20)

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deployments