Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner

moviri.com
Hitchhiker’s guide for the Capacity Planner
Connecticut Computer Measurement Group
Connecticut Computer Measurement Group
Cromwell CT – April 2015
Renato Bonomini renato.bonomini@moviri.com
Capacity Management and BigData

2
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure

Brought to you by…
Renato Bonomini
Lead of US operations
for Moviri
@renatobonomini
Mattia Berlusconi
Capacity
Management
Consultant
Giulia Rumi
Capacity
Management
Analyst
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 3

4
Agenda

Handling large amount of data?High Performance Computing?
Is it new? Where does it come from? Why do I have to listen to this?
5
Cray 1, 80 MFLOPS, 1975
[A bunch of engineers on a field trip in Silicon Valley, Renato]
IBM 350, 3.56 Mb, 1956
[Wikipedia]
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved

“When will computer
hardware match the
human brain?”
Hans Moravec
Robotics Institute
Carnegie Mellon
University
The need for
Analytics:
the new
“machine
revolution”
6
http://www.transhumanist.com/volume1/moravec.htm

February 20151964: Isaac Asimov on 2014 World’s Fair
“The world of A.D. 2014 will have
few routine jobs that cannot be
done better by some machine than
by any human being.
Mankind will therefore have
become largely a race of machine
tenders.”
“When will computer hardware match the human brain?”
7
http://reuvengorsht.com/2015/02/07/machines-replace-middle-management

“Map and Reduce”“Divide et Impera”
• Julius Caesar arrives in Alexandria after
defeating the Egyptian army and enters the
Ancient Library
• Surprise: there are millions of copies in the
library, how many of those are in latin?
• Caesar arranges a Centuria (80 soldiers) to
inspect each one a batch of books and report
to their Centurion the number of pages
written in Latin for their book
• The Centurion writes on a tabula the count
from each soldier; when finished he sums the
part up
All I need to know I learned from Rome
8
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat

“Map and Reduce”Message Passing Interface
Wow so “Map and Reduce” was a revolution? In one sense, which one?
9
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
MPI tutorial Blaise Barney
Lawrence Livermore National Laboratory
C, Fortran
Java, Python

1. MapReduce makes technologies available to a wide audience
 We saw that MPI already handled similar use cases, but it was restricted mostly to University
Research and large R&D facilities
2. Reliability and commodity hardware at its base
3. It moves the needle on how to handle large amount of data
 Database: organize first, then load
 Hadoop: load first, then organize
What are the revolutions brought by MapReduce and BigData?
10Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved

11
Agenda

● “Hardware” contains libraries and utilities, stores data, and supports
jobs execution
● HDFS is the fault-tolerant, replicated distributed file-system
● YARN (Yet Another Resource Negotiator) includes several
programming models that can co-exist in the cluster and MapReduce
is only one of them
● The Application layer is composed of several frameworks, among
which Pig and Hive are the most used.
Hadoop workflow
● clients break data into small chunks to be loaded onto different data
nodes
● for each datablock, client contacts namenode and it answer with a
sorted list of 3 data nodes (every block is replicated in more than one
machine)
● the client writes the blocks directly onto the datanode, the datanode
replicates the data onto the two nodes
The most famous open-source implementation of a MapReduce
framework is Apache Hadoop
12
Optimization Techniques within the Hadoop Eco-system: a Survey
Giulia Rumi, Claudia Colella, Danilo Ardagna
LAYERS HADOOP 1.X HADOOP 2.X
Users
Application
layer
Programming
Models
Resource
Management
File system
Hardware
Hive/Pig
Hadoop 1.X
MapReduce
HDFS
Hive/Pig
HDFS
YARN
MapReduce

Apache Hadoop Ecosystem
13
http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
Apache Hadoop documentation
Between the Apache Hadoop
Ecosystem and the NOSQL world,
new applications are being
developed every day
https://hadoopecosystemtable.github.io
http://nosql-database.org/select-the-right-
database.html

Geek Fun
“A DBA walks into a NOSQL bar,
but turns and leaves
because he couldn't find a table”
(webtonull)

● HDFS (Hadoop distributed filesystem) is
where Hadoop cluster stores data
● YARN is the architectural center of Hadoop
that allows multiple data processing engines
● MapReduce is a programming paradigm
● Hive provides a warehouse structure and
SQL-like access for data in HDFS
● Pig A high-level data-flow language
● Hbase is an open-source, distributed,
versioned, column-oriented store that sits
on top of HDFS.
• Apache Spark is an open source big data real
time processing framework
• ZooKeeper is an open source Apache project
that provides a centralized infrastructure and
services that enable synchronization across a
cluster
• Apache Cassandra is an open source
distributed database management system
designed to handle large amounts of data
across many commodity servers
• Solr is an opensource enterprise search
platform from the Apache Lucene project. It
provides full-text search, hit highlighting,
faceted search, dynamic clustering, database
integration and rich document handling.
We are going to focus on a few specific “animals” of this zoo

16
Agenda

“we’ll start the new Hadoop cluster with 500 TB and then we’ll see how much we need”
Real conversation at customer
Why do you need to get on board soon?
There are significant resources and areas of improvement
● Significant investments are being directed towards these
initiatives
● They are complex, large, with hundredths of configuration
parameters: a little help from experienced capacity
planner can save a lot of money

● Shouldn’t the ‘Hadoop user/owner’ take care of this?
 Distributed machine learning is still an active research topic, It is related to both
machine learning and systems
 While Hadoop users don’t develop systems, they need to know how to choose
systems. An important fact is that existing distributed systems or parallel
frameworks are not particularly designed for machine learning algorithms
● Hadoop users can
 help to affect how systems are designed
 design new algorithms for existing systems
Role of the Capacity Planner and Performance Analyst

● http://blog.cloudera.com/blog/2010/08/hadoophbase-
capacity-planning
● http://blog.cloudera.com/blog/2013/08/how-to-select-
the-right-hardware-for-your-new-hadoop-cluster
● http://hortonworks.com/cluster-sizing-guide
● http://docs.hortonworks.com/HDPDocuments/HDP1/H
DP-1.3.7/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
Vendor guidelines – if you are in a hurry you can stop here

● Scheduling is one of the most important tasks in a multi-
concurrent-task system: research from our colleague
Giulia (and others) on “Optimization Techniques within
the Hadoop Eco-system: a Survey” [DOI: 10.1109/SYNASC.2014.65]
● This illustrates the typical optimization problems:
 data locality
 sticky slots problems
 poor system utilization because of suboptimal distribution
of tasks
 unbalanced jobs
 starvation and even fairness (be fair to your users)
● There are hundredths of configuration variables available
to the end-user: rule of thumb vs. optimal configuration
can make a big difference
Current performance tuning opportunities:
Scheduling

● Other initiatives
 starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of
MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA
shivnath@cs.duke.edu]
 Research from Dominique A. Heger of DHT [Workload Dependent Hadoop MapReduce Application
Performance Modeling]
● The common result of most research initiatives is “One size does not fit all”
 Example for classic MapReduce: there is not a single behavior, you have to know your workload
characterization
 “Hortonworks recommends that you either use the Balanced workload configuration or invest in a
pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment”
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-
guide/content/typical-workloads.html
How are other configuration opportunities being pursued?

● You want to know what is the limiting
factor of each workload
● Examples are
 CPU performance
 Disk I/O
 Memory (bandwidth and latency)
 Network (bandwidth, delay, packet loss)
 storage space
● This is nothing new for the
wise Capacity Planner!
Profiling your workload
Courtesy of Intel

23
Agenda

Different point of views for analysis
• Interest in fast response for
“interactive workload”
– CPU, Memory, Network and IO utilization
levels to respond to queries in a quick and
effective way
• Interest in high throughput for
“batch workloads”
– Maximize the utilization levels, not interested
in response time
• Interest in storage capacity
– Understand and plan file system and HDFS
Different types of Workload
• Most companies are simply using Hadoop to
store information (HDFS) for big data-sets
• Vendors incorporate many other
components: hdfs, hive, spark, solr, flume,
etc.
• For example, there are significant differences
in Hadoop and HBase workloads
– Hadoop MapReduce is is a framework to
process large set of data, using distributed
and parallel algorithms
– HBase is much better for real-time
read/write/modify access to tabular data
Hadoop is a “zoo” of several different applications

For each component
let’s make a summary of
• how they work so that we can focus on the
type of workload
• what the bottlenecks could be, in the order
we usually find them
• what technique (a) (b) or (c) could apply
• what similar ‘traditional’ technology could be
used as analogy
3 standard types of analyses
We’ll check what’s underneath each component
to file them under 3 simple analysis we are all
friends with:
a. interactive workload > you are interested in
a good response time
b. batch workloads > you are interested in
maximizing utilization, optimal concurrency
and best volume/duration ratio
c. storage > used/free space
Get your feet wet!

Online vs streaming vs batch – frame the problem as you already know
http://www.hadoop360.com/blog/batch-vs-real-time-data-processing

How to get started
• HDFS is a write once, read many (or WORM-
ish) filesystem: only append to the file
– it keeps growing and growing!
• NameNode
– Monitor the disk space available to the
NameNode (local or remote when diversified
storage is used for resilience as
recommended)
• DataNode
– IO is important
– disk space another dimension
What it is
• where Hadoop cluster stores data, functions
include
– storage of the files metadata, overseeing the
health of datanode, coordination of the
access to data
• 2 main components
– NameNode, it is the master of HDFS, memory
and I/O intensive
– Datanode manages storage attached to the
nodes
HDFS is append-only file system; it does not
allow data modification
HDFS Hadoop distributed filesystem

How to get started/2
• Capacity analysis approach:
– (a),(b) or (c)
• Similar technology
– high level, manage it as any logical storage
device
Bottleneck
• Disk IO (volume of IOps and response time)
• Network bandwidth
• storage space [you need 4x times the raw
size of the data you will store in the HDFS.
However on average we have seen a
compression ratio of up to 10-20 for the text
files stored in HDFS. So the actual raw disk
space required is only about 30-50% of the
original uncompressed size]
HDFS Hadoop distributed filesystem/2

How to get started
• Bottleneck
– for every component
• Disk IO
• Network
– for node manager (slave)
• CPU
• Capacity analysis approach:
– (b)
– Job Scheduler
What it is
• YARN is the architectural center of Hadoop
that allows multiple data processing engines
such as interactive SQL, real-time streaming,
data science and batch processing to handle
data stored in a single platform.
YARN Yet Another Resource Negotiator

How to get started
• Bottleneck
– JVM Memory metrics
– Very much workload dependent! You have to
profile your application
What it is
• Remember: it is a programming paradigm,
not a standalone application. it mainly
consist of two phases:
– In Map phase, the main work is reading data
blocks and splitting into Map tasks in parallel
processing. The result is temporarily stored in
the memory and disk
– The work in reduce stage is concentrating the
output of the same key to the same Reduce
task and processing it, output the final result.
Map&Reduce

How to get started
• possible bottlenecks
– Memory
– Disk IO
– Network
• Capacity analysis approach
– (a) or (b)
– data warehouse
What they are
• Hive provides a warehouse structure and
SQL-like access for data in HDFS and other
Hadoop input sources (e.g. Amazon S3).Hive's
query language, HiveQL, compiles to
MapReduce
• Pig is a high-level language for writing
queries over large datasets. A query planner
compiles, queries written in this language
(called "Pig Latin") into maps and reduces
which are then executed on a Hadoop
cluster. Pig main features are: ease of
programming, optimization opportunities,
customization, extensibility
Pig & Hive

How to get started
• possible bottlenecks
– Memory (be careful of swapping, JVM
memory metrics and GC) GC pauses longer
than 60 seconds can cause RS to go offline
– Disk IO (in case data is spooled to disk)
– Network (latency)
• Capacity analysis approach
– (a)
– distributed DBMS
What it is
• HBase is column-based rather than row-based,
which enables high-speed execution of
operations performed over similar values across
massive data sets,
• HBase directly runs on top of HDFS
• It scales linearly by requiring all tables to have a
primary key. The key space is divided into
sequential blocks that are then allotted to a
region. RegionServers own one or more regions,
so the load is spread uniformly across the
cluster. HBase can further subdivide the region
by splitting it automatically, so that manual data
sharding is not necessary.
HBASE

33
Agenda

• CPU
 Utilization (user/sys/wio)
 load
• Memory
 Utilization
 used (cached, user, sys)
 swap in/out
• disk IO
 read/write ops rate
 read/write ops byte rate
• network
 sent/received packets and bits
• Garbage Collection
 collections count and time
 overhead (time percentage spent in GC), very
important
• Heap memory
– Size, used
– used after GC (much more valuable, you can
correlate it with workload)
– Perm Gen/Code Cache/Eden Space 'used'
– PS Old/Perm/ Gen 'used'
– Tenured Gen 'used'
– PS Eden/Survivor/PS Survivor Space 'used'
• JVM threads
 Count
 daemon count
• JVM files
 JVM open/max open files
It sounds all good so far but which metrics do I need?
Laundry list – generic metrics

• HDFS namenode:
 storage: total and used capacity
 Files created/total/deleted
• HDFS datanode
 Fs: bytes read/written
 Fs: reads/writes from local/remote client
 “map reduce blocks”: volume of read,
written/removed/replicated/verified
 “map reduce blocks operations”:
copy/read/replace/write, avg time/volume
• YARN resource manager
 active/decom/unhealthy NodeManagers
 active applications/users
 applications submitted, completed, failed,
killed
 applications pending, running
 containers
allocated/released/pending/reserved
• HBASE
– Request (total/read/write)
– memory stores size, upper limit
– flush queue length
– compaction queue length
• ZooKeeper
– sent/received packets
– request latency
– outstanding requests
– JVM pool size
• Solr
– request rate/latency
– JVM pool size
– added docs rate
– query result cache size, hits %, response time
– document cache size
It sounds all good so far but which metrics do I need?
Laundry list – specific metrics

Headquarters
Via Schiaffino 11C
20158 Milan MI
Italy
T +39-024951-7001
USA East
One Boston Place, Floor 26
Boston, MA 02108
USA
T +1-617-936-0212
USA West
425 Broadway Street
Redwood City, CA 94063
USA
T +1-650-226-4274
moviri.com

● Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level
operators and you can use it interactively to query data within the shell. It is a comprehensive, unified framework
to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to
Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use
case.
● Features
 everything is in memory
 data is stored in memory into a number of files (RDD files)
 best for cyclic jobs
 best perf with cyclic job (performance 100 times better wrt hadoop)
● possible bottlenecks:
 Memory
 Network + Disk IO (remote/local files)
 CPU
● Capacity analysis approach: (a) or (b) depending on the workload
● Similar technology: similar to Hadoop MapReduce generic case
Spark

● Applications can leverage these services to coordinate distributed processing across large clusters. A
very large Hadoop cluster can be supported by multiple ZooKeeper servers.
● Each client machine communicates with one of the ZooKeeper servers to retrieve and update its
synchronization information. Often network and memory problems manifest themselves first in ZK
 CPU wio
 Memory (JVM) latency
GC pauses longer than 60 seconds can cause RS to go offline
 Network (latency)
● Capacity analysis approach: (a)
● Similar technology: in-memory database
Zookeeper

● Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous
master-less replication allowing low latency operations for all clients
 Memory
 Disk IO
 Network
● Capacity analysis approach: (a) and (c)
● Similar technology: distributed DBMS
Cassandra

● Solr is highly scalable: it provide distributed search and index replication. It is the most popular search
engine
 Memory (at the JVM level)
 CPU
 Disk IO
● Capacity analysis approach: (a)
● Similar technology: distributed DBMS
Solr

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner

Similar to Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner (20)

Recently uploaded

Recently uploaded (20)

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner

Editor's Notes