Hadoop cluster performance profiler

HADOOP CLUSTER
PERFORMANCE
PROFILING
Ihor Bobak
Lead Software Engineer, EPAM Systems
AUGUST 27, 2015

CONTENTS
Covered topics:
• What is profiling? How do profilers work?
• What problems can affect performance?
• How to profile a distributed application?
• Gathering, storing and analysis of stack traces
• Memory analysis
• Use Case
• Alternative approaches to profiling

3
WHAT IS A PROFILER?
Profiler is a tool to look what parts of your app is working slowly.
VisualVM YourKit

4
HOW DO PROFILERS WORK?
• Instrumenting: adding extra bytecode to your
methods for recording when they’re called and
how long they execute.
• Sampling: taking dumps of all the threads
periodically in order to understand how much
CPU time each method takes.

5
DIFFUCULTIES WITH A CLUSTER
This is a typical mapreduce application running on a Hadoop cluster.
All blue boxes are separate JVM processes running on different
machines. Question: how can we profile a distributed Java app?
1. How to attach to a process
running on another host?
2. How to track the appearance
of new processes?
3. How to gather profiling data?
4. How to analyze this vast
amount of data?

6
WHY DO WE NEED A CLUSTER PROFILER?
Answer: we need a profiler to get more performance.
Hadoop principle is next:
“If you want more performance, add more hardware”.
This is a truth. But this is not the only truth.
Another truth is: there are problems that are related to
ALL applications (both distributed and local).

7
PROBLEM 1: NOT OPTIMAL CODE
public static void QuickSort(int[] a, int x, int y){
int pivot = (x+y)/2;
int apivot = a[pivot];
int i = x;
int j = y;
while (i <= j){
while (a[i] < apivot) i++;
while (a[j] > apivot) j--;
if (i <= j){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
i++;
j--;
}
}
if (x < j)
QuickSort(a, x, j);
if (i < y)
QuickSort(a, i, y);
}
public static void StupidSort(int[] a){
for (int i = 0; i < a.length - 1; ++i)
for (int j = i + 1; j < a.length; ++j)
if (a[i] > a[j]){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}
}
“Tupo-v-lob” sort: O(N^2) Quicksort: O(N*log(N))
This is a simple example of different algorithm solving the same tasks:

8
PROBLEM 2: BAD CODE/DATA
• Repeatedly doing the same unnecessary actions
Example: re-reading the configuration file or a database table again and again during every operation (although we
could cache it in the memory).
• Wrong usage of someone’s code/libraries/binaries
Example: sqoop can import from MySQL in two modes – direct mode (using mysqldump and mysqlimport) and JDBC-
mode. The first one is faster.
• Usage of wrong libraries
Example: https://powercollections.codeplex.com/workitem/16950
I found that famous Wintellect’s OrderedSet works 3 times slower than native Microsoft’s SortedSet.
• Absense of indexes in a database
Example: “select * from fact join dim on fact.productid = dim.productid” is slow because developers missed to
make keys/indexes
• Bugs in famous libraries/frameworks
Example: http://ihorbobak.com/index.php/2015/06/03/spark-sql-bad-performance/ problem with A->B->C tables
join when enumerated in order A, C, B. This is handled fine by all database servers, but NOT by Spark SQL.

9
PROBLEM 3: HARDWARE TROUBLES
Two most important problems are:
• Disk problems (slow I/O speed)
• Network problems (slow bandwidths, packets
loss)

10
CLUSTER PROFILER ARCHITECTURE
Java Process
“Injected” code
which does stacktrace
sampling
Passing stacktraces
each 10 seconds
though HTTP
A set of Python/Perl
scripts to get
visualizations
Visualization in the
form of flame graphs
This is applicable to any java process:
mapper, reducer, etc., and it is applicable
not only Hadoop: it can be Spark RDD
code, Java web app code, etc.

11
HOW JAVA AGENT WORKS?
• Agent is bound to a java process by specifying -javaagent parameter, e.g.
java –javaagent:/path/agent.jar=parameters MainClass
or by overriding _JAVA_OPTIONS like this:
_JAVA_OPTIONS='-javaagent:/path/agent.jar=parameters
• Agent’s jar has a manifest with
PreMain-Class: namespace.TheAgentClass
• “TheAgentClass” has a premain() method that executes before your
main() and does the following:
– Read the parameters of the agent
– Constructs the profiler instances (based on parameters)
– Creates a ScheduledExecutorService (see java.util.concurrent) that does
scheduleAtFixedRate(worker, 0, 10, TimeUnit.SECONDS)

12
HOW JAVA AGENT WORKS?
The profiler thread collects stacktraces 100 times per second using ThreadMXBean (a
part of JMX – a technology for monitoring and managing the JVM)
public void profile() {
profileCount++;
try{
for (ThreadInfo thread : getAllRunnableThreads()) {
if (thread.getStackTrace().length > 0) {
String traceKey = StackTraceFormatter.formatStackTrace(thread.getStackTrace());
if (filter.includeStackTrace(traceKey))
traces.increment(traceKey, 1);
}
}
}
catch (OutOfMemoryError ex)
{
// ... skipping code for handling OOM (just for safety)
}
if (profileCount == reportingFrequency) {
profileCount = 0;
recordMethodCounts();
}
}
For more information about JMX read here:
https://docs.oracle.com/javase/tutorial/jmx/index.html

13
STATSD + MY CHANGES
I made a modification of a famous StatsD JVM profiler https://github.com/etsy/statsd-
jvm-profiler
List of my changes:
• Added the jvmName and host tag to each stacktrace;
• Optimized performance in stacktraces collection code;
• Improved stability - added catching of OutOfMemoryException;
• Added statistics to show how many lines and characters we pass to the backend;
• Seriously modified the influxdb_dump.py: now it extracts data into a set of distinct
files - one file for each JVM, each host and a total.
• Added extraction of memory information and rendering it with charts in R
• Added call_tree.py: a script for analysis of the method call trees
• Added some helper scripts.

14
INFLUXD
What is InfluxDB?
It is a time series, metrics, and analytics database.
Targeted at:
gathering metrics (like response times, CPU load), sensor
data, events (like exceptions) and real-time analytics.
Key Features:
• SQL-like query language;
• HTTP(S) API for data ingestion and queries;
• Built-in support for other data protocols such as
collectd;
• Has a CLI and web interface;
• Tag data for fast and efficient queries.

15
Measurements
(analog of tables)
tag keys:values
SQL-like query language
timestamps
Series:
measurement
name
+ tag key-values
+ data
values

16
Schema exploration examples:
• SHOW MEASUREMENTS
shows the list of measurements
• SHOW SERIES FROM /.*cpu.*/
shows the list of series for each measurement whose name matches the
pattern /*.cpu.*/
• SHOW TAG KEYS FROM /.*heap.*/
shows different tag keys from measurements that match pattern
• SHOW TAG VALUES FROM /.*cpu.*/ WITH KEY = jvmName
shows different tag keys from measurements that match pattern
Data exploration examples:
• SELECT * FROM cpu WHERE host = ‘A’
selects series for “cpu” measurement with tag host=‘A’
• SELECT percentile(value, 95) FROM response_times
WHERE time > now() - 1d
GROUP BY time(1m)
shows the 95th percentile of response times in the last day in 1 minute
interval

17
FLAME GRAPHS
D D
C C C
B B B B
A A A A
0th
ms
10th
ms
20th
ms
30th
ms
Gathered stack traces:
A->B->C
A->B->C->D
A->B->C->D
A->B
D D
C C C
B B B B
A A A A
0th
ms
10th
ms
20th
ms
30th
ms
THE WIDTH OF A BAR MATTERS.
Color doesn’t matter and is selected just to distinguish bars.

18
FLAME GRAPHS
Flame graphs are a visualization of profiled software, allowing the
most frequent code-paths to be identified quickly and accurately.
Invented by Brendann Gregg: http://www.brendangregg.com

19
SEQUENCE OF ACTIONS
Steps to Profile a Cluster:
1. Install InfluxDB on a separate machine visible to all machines of the cluster.
Create a database and a user.
2. Get the agent’s jar file from my blog (or from sources) and put it into
/var/lib at every worker node.
3. Change the configuration of the cluster: make _JAVA_OPTIONS=‘-
javaagent…’ available to all JVM processes.
4. Run your application and get the stacktraces in the InfluxDB. You may
“switch off” the _JAVA_AGENT after this.
5. Get the SVG files (flame graphs) from InfluxDB with the help of
influxdb_dump.py and flamegraph_files.sh and do the analysis.
These steps are described in detail at my blog http://ihorbobak.com

20
LOCATION FOR _JAVA_OPTIONS
_JAVA_OPTIONS='-javaagent:/var/lib/statsd-jvm-profiler-0.8.3-
SNAPSHOT.jar=server=serveraddress,port=8086,reporter=InfluxDBReporter,database=profiler,us
ername=profiler,password=profiler,prefix=value1.value2.valueN,tagMapping=tag1,tag2,tagN'

21
USE CASE WITH A REAL CUSTOMER
The App/Inventory/Environment:
•Our customer has an app that crawls data from a set of sites, parses it
and puts to a Hadoop cluster (20 machines with 8 cores, 32GB RAM
and 1TB HDD each).
•The app leverages Apache Nutch, Cloudera Hadoop distribution
version 5.3, Hbase, MongoDB and other technologies.
•There is a central Java web app (Java/Tomcat) that uses Nutch which
runs the mapreduce jobs.
The problem:
•The cluster crawls just 100 sites per day; a customer is asking us
“how to make it crawl 10 times more on the same hardware?”

22
FIRST FINDINGS
The first question that arose in my head: what exactly works slowly?
At the beginning I quickly found this: slow are the parts that are I/O intensive.

23
DISK I/O
Then I did I/O monitoring procedures and a series of test of disk speed on nodes.
This is the result of IOPS benchmark
https://github.com/cxcv/iops/blob/master/iops:
512 B blocks: 80.9 IO/s, 40.4 KiB/s (331.3 kbit/s)
1 KiB blocks: 97.9 IO/s, 97.9 KiB/s (802.1 kbit/s)
2 KiB blocks: 83.8 IO/s, 167.5 KiB/s ( 1.4 Mbit/s)
16 KiB blocks: 69.4 IO/s, 1.1 MiB/s ( 9.1 Mbit/s)
512 KiB blocks: 29.0 IO/s, 14.5 MiB/s (121.8 Mbit/s)
1 MiB blocks: 18.3 IO/s, 18.3 MiB/s (153.2 Mbit/s)
512 B blocks: 861.1 IO/s, 430.5 KiB/s ( 3.5 Mbit/s)
512 KiB blocks: 423.3 IO/s, 211.6 MiB/s ( 1.8 Gbit/s)
1 MiB blocks: 295.1 IO/s, 295.1 MiB/s ( 2.5 Gbit/s)
Cluster Node My local VM
Cluster node is 10 times slower than a VM running on my development
workstation (the host is Core i7/32GB/1TB, guest is 3-core VM with 16GB RAM)

24
FETCHER MAPREDUCE JOB
% of CPU time:
15% - HTML parsing
15% - Hadoop
framework
initialization code
7% - HDFS
initialization code
22% - reducer code
(BAD NEWS HERE)
18% - reading
Hadoop XML config
files
23% - real job

25
DRILL DOWN INTO THE REDUCER
org.apache.hadoop.hbase.
catalog.MetatataReader.
fullScan()
org.apache.avro.
Schema$Parser.parse()
ending with ZipFile.read,
ZipFile.getEntry(), etc.
org.apache.hadoop.hbase.
client.HConnectionManager.
createConnection()
Creating a record writer
Parsing
avro
schema
Fetcher
Reducer
.run()

26
DRILL DOWN INTO THE RECORD WRITER
This is Gora
library code
Most observable
function calls on
top are:
java.util.zip.*
FileInputStream*
FileOutputStream*

27
REPEATING SLOW PARTS IN ALL JOBS

28
INEFFECTIVE MEMORY MANAGEMENT
Most of Java processes used
significantly less memory
than they were initially
assigned.
Legend:
• init - the initial amount of memory that the
JVM requests from the OS during startup;
• used - the amount of memory currently
used;
• Committed - the amount of memory that is
guaranteed to be available for use by the
Java virtual machine;
• Max - represents the maximum amount of
memory (in bytes) that can be used for
memory management.
A memory allocation may fail if it attempts
to increase the used memory such that used
> committed even if used <= max would still
be true

29
PROBLEMS AND NEXT STEPS
1) Gora + HBase
Reasons: Bad code in Gora (too many metadata full table scans)
Actions:
• check Gora’s configuration, dive into the code to find out why it does full scan
• try Cassandra instead of HBase
2) Hadoop Framework parts, in particular:
• HDFS initialization in mapreduce jobs (slow communication with Namenode)
• Reading configuration files (it is done with Xerces library ).
Possible Reasons:
• Bad I/O speed and bad network speed.
• There can be some parameterizing of XML parsing of config files that we’re not aware of.
Actions:
• fix the hardware issues.
• Search for why Hadoop XML config parsing may be so slow
• Check namenode memory usage

30
OTHER METHOD OF GETTING STACKTRACE
Another method to get stack traces is Linux’s perf_events:
perf record -F 99 -g -p PID
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
Perf monitors:
• Hardware events (e.g. level 2 cache
misses);
• Software events (e.g. CPU migrations)
• Tracepoint events (e.g. filesystem I/O,
TCP events)
Perf can also do
• Sampling: collection of snapshots at some
frequency (by timer)
• Dynamic tracing: instrumenting code to
create events in any location (using
kprobes or uprobes frameworks)
For more details see: http://www.brendangregg.com/perf.html

31
PERF vs. JAVA AGENT
Advantages of perf over java agent:
• low overhead when getting stack traces;
• combining user calls (Java) and kernel calls in one flame graph.
• Will 100% catch all Java methods (no matter that JVM may
exclude safepoint checks from hot methods)
(http://chriskirk.blogspot.com/2013/09/what-is-java-safepoint.html - a good
explanation about safepoints).
Disadvantages of perf:
• Cannot get Java’s stacktraces (it is necessary to fix frame pointer-based stack
walking in OpenJDK – done by Netflix and Twitter)
• Doesn’t see Java symbols (hex numbers instead; special agent needed to add
symbols https://github.com/jrudolph/perf-map-agent )
• Permissions must be configured to symbol files
• It is necessary to develop a service which will launch perf, get stacktraces and
pass them to a server.

32
PERF vs. JAVA AGENT
And…. it happens that Netflix’s product is open sourced…

33
CREDITS
Andrew Johnson
Software Engineer at Etsy
Previously: Explorys, Inc.
https://www.linkedin.com/in/ajsquared
Brendann Gregg
Senior Performance Architect at Netflix
Previously: Joyent, Oracle, Sun Microsystems
http://www.brendangregg.com/index.html

34
BLOGS/ARTICLES
Blogs:
• My blog article
http://ihorbobak.com/index.php/2015/08/05/cluster-profiling/
• Etsy’s blog about JVM Profiler
https://codeascraft.com/2015/01/14/introducing-statsd-jvm-profiler-a-jvm-profiler-for-hadoop/
https://codeascraft.com/2015/05/12/four-months-of-statsd-jvm-profiler-a-retrospective/
• Brendan Gregg’s blog
http://www.brendangregg.com/blog/index.html
Source code:
• My modification of StatsD JVM Profiler
https://github.com/ibobak/statsd-jvm-profiler
• Original Etsy’s StatsD JVM Profiler
https://github.com/etsy/statsd-jvm-profiler
• Brendan Gregg’s FlameGraph
https://github.com/brendangregg/FlameGraph
Manuals:
• InfluxDB Docs
https://influxdb.com/docs/v0.9/introduction/overview.html
• Overview of the JMX Technology
https://docs.oracle.com/javase/tutorial/jmx/overview/index.html
• JVM Tool Interface
http://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#starting

35
BOOKS / VIDEOS
• Systems Performance: Enterprise and the Cloud
by Brendan Gregg
http://www.amazon.com/Systems-Performance-
Enterprise-Brendan-Gregg/dp/0133390098
• Blazing Performance with Flame Graphs
by Brendan Gregg
https://www.youtube.com/watch?v=nZfNehCzGdw
• Linux profiling at Netflix
by Brendan Gregg
https://www.youtube.com/watch?v=_Ik8oiQvWgo
• Profiling Java in Production
by Kaushik Srenevasan, Twitter University
https://www.youtube.com/watch?v=Yg6_ulhwLw0

36
Contacts:
Ihor Bobak
E-mail: ibobak@gmail.com
Skype: ibobak

Hadoop cluster performance profiler

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop cluster performance profiler

Similar to Hadoop cluster performance profiler (20)

Recently uploaded

Recently uploaded (20)

Hadoop cluster performance profiler

Editor's Notes