Ihor Bobak
Lead Software Engineer, EPAM Systems
AUGUST 27, 2015
Covered topics:
• What is profiling? How do profilers work?
• What problems can affect performance?
• How to profile a distributed application?
• Gathering, storing and analysis of stack traces
• Memory analysis
• Use Case
• Alternative approaches to profiling
Profiler is a tool to look what parts of your app is working slowly.
VisualVM YourKit
• Instrumenting: adding extra bytecode to your
methods for recording when they’re called and
how long they execute.
• Sampling: taking dumps of all the threads
periodically in order to understand how much
CPU time each method takes.
This is a typical mapreduce application running on a Hadoop cluster.
All blue boxes are separate JVM processes running on different
machines. Question: how can we profile a distributed Java app?
1. How to attach to a process
running on another host?
2. How to track the appearance
of new processes?
3. How to gather profiling data?
4. How to analyze this vast
amount of data?
Answer: we need a profiler to get more performance.
Hadoop principle is next:
“If you want more performance, add more hardware”.
This is a truth. But this is not the only truth.
Another truth is: there are problems that are related to
ALL applications (both distributed and local).
public static void QuickSort(int[] a, int x, int y){
int pivot = (x+y)/2;
int apivot = a[pivot];
int i = x;
int j = y;
while (i <= j){
while (a[i] < apivot) i++;
while (a[j] > apivot) j--;
if (i <= j){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
if (x < j)
QuickSort(a, x, j);
if (i < y)
QuickSort(a, i, y);
public static void StupidSort(int[] a){
for (int i = 0; i < a.length - 1; ++i)
for (int j = i + 1; j < a.length; ++j)
if (a[i] > a[j]){
int temp = a[i];
a[i] = a[j];
a[j] = temp;
“Tupo-v-lob” sort: O(N^2) Quicksort: O(N*log(N))
This is a simple example of different algorithm solving the same tasks:
• Repeatedly doing the same unnecessary actions
Example: re-reading the configuration file or a database table again and again during every operation (although we
could cache it in the memory).
• Wrong usage of someone’s code/libraries/binaries
Example: sqoop can import from MySQL in two modes – direct mode (using mysqldump and mysqlimport) and JDBC-
mode. The first one is faster.
• Usage of wrong libraries
I found that famous Wintellect’s OrderedSet works 3 times slower than native Microsoft’s SortedSet.
• Absense of indexes in a database
Example: “select * from fact join dim on fact.productid = dim.productid” is slow because developers missed to
make keys/indexes
• Bugs in famous libraries/frameworks
Example: problem with A->B->C tables
join when enumerated in order A, C, B. This is handled fine by all database servers, but NOT by Spark SQL.
Two most important problems are:
• Disk problems (slow I/O speed)
• Network problems (slow bandwidths, packets
Java Process
“Injected” code
which does stacktrace
Passing stacktraces
each 10 seconds
though HTTP
A set of Python/Perl
scripts to get
Visualization in the
form of flame graphs
This is applicable to any java process:
mapper, reducer, etc., and it is applicable
not only Hadoop: it can be Spark RDD
code, Java web app code, etc.
• Agent is bound to a java process by specifying -javaagent parameter, e.g.
java –javaagent:/path/agent.jar=parameters MainClass
or by overriding _JAVA_OPTIONS like this:
• Agent’s jar has a manifest with
PreMain-Class: namespace.TheAgentClass
• “TheAgentClass” has a premain() method that executes before your
main() and does the following:
– Read the parameters of the agent
– Constructs the profiler instances (based on parameters)
– Creates a ScheduledExecutorService (see java.util.concurrent) that does
scheduleAtFixedRate(worker, 0, 10, TimeUnit.SECONDS)
The profiler thread collects stacktraces 100 times per second using ThreadMXBean (a
part of JMX – a technology for monitoring and managing the JVM)
public void profile() {
for (ThreadInfo thread : getAllRunnableThreads()) {
if (thread.getStackTrace().length > 0) {
String traceKey = StackTraceFormatter.formatStackTrace(thread.getStackTrace());
if (filter.includeStackTrace(traceKey))
traces.increment(traceKey, 1);
catch (OutOfMemoryError ex)
// ... skipping code for handling OOM (just for safety)
if (profileCount == reportingFrequency) {
profileCount = 0;
For more information about JMX read here:
I made a modification of a famous StatsD JVM profiler
List of my changes:
• Added the jvmName and host tag to each stacktrace;
• Optimized performance in stacktraces collection code;
• Improved stability - added catching of OutOfMemoryException;
• Added statistics to show how many lines and characters we pass to the backend;
• Seriously modified the now it extracts data into a set of distinct
files - one file for each JVM, each host and a total.
• Added extraction of memory information and rendering it with charts in R
• Added a script for analysis of the method call trees
• Added some helper scripts.
What is InfluxDB?
It is a time series, metrics, and analytics database.
Targeted at:
gathering metrics (like response times, CPU load), sensor
data, events (like exceptions) and real-time analytics.
Key Features:
• SQL-like query language;
• HTTP(S) API for data ingestion and queries;
• Built-in support for other data protocols such as
• Has a CLI and web interface;
• Tag data for fast and efficient queries.
(analog of tables)
tag keys:values
SQL-like query language
+ tag key-values
+ data
Schema exploration examples:
shows the list of measurements
• SHOW SERIES FROM /.*cpu.*/
shows the list of series for each measurement whose name matches the
pattern /*.cpu.*/
• SHOW TAG KEYS FROM /.*heap.*/
shows different tag keys from measurements that match pattern
• SHOW TAG VALUES FROM /.*cpu.*/ WITH KEY = jvmName
shows different tag keys from measurements that match pattern
Data exploration examples:
• SELECT * FROM cpu WHERE host = ‘A’
selects series for “cpu” measurement with tag host=‘A’
• SELECT percentile(value, 95) FROM response_times
WHERE time > now() - 1d
GROUP BY time(1m)
shows the 95th percentile of response times in the last day in 1 minute
Gathered stack traces:
Color doesn’t matter and is selected just to distinguish bars.
Flame graphs are a visualization of profiled software, allowing the
most frequent code-paths to be identified quickly and accurately.
Invented by Brendann Gregg:
Steps to Profile a Cluster:
1. Install InfluxDB on a separate machine visible to all machines of the cluster.
Create a database and a user.
2. Get the agent’s jar file from my blog (or from sources) and put it into
/var/lib at every worker node.
3. Change the configuration of the cluster: make _JAVA_OPTIONS=‘-
javaagent…’ available to all JVM processes.
4. Run your application and get the stacktraces in the InfluxDB. You may
“switch off” the _JAVA_AGENT after this.
5. Get the SVG files (flame graphs) from InfluxDB with the help of and and do the analysis.
These steps are described in detail at my blog
The App/Inventory/Environment:
•Our customer has an app that crawls data from a set of sites, parses it
and puts to a Hadoop cluster (20 machines with 8 cores, 32GB RAM
and 1TB HDD each).
•The app leverages Apache Nutch, Cloudera Hadoop distribution
version 5.3, Hbase, MongoDB and other technologies.
•There is a central Java web app (Java/Tomcat) that uses Nutch which
runs the mapreduce jobs.
The problem:
•The cluster crawls just 100 sites per day; a customer is asking us
“how to make it crawl 10 times more on the same hardware?”
The first question that arose in my head: what exactly works slowly?
At the beginning I quickly found this: slow are the parts that are I/O intensive.
Then I did I/O monitoring procedures and a series of test of disk speed on nodes.
This is the result of IOPS benchmark
512 B blocks: 80.9 IO/s, 40.4 KiB/s (331.3 kbit/s)
1 KiB blocks: 97.9 IO/s, 97.9 KiB/s (802.1 kbit/s)
2 KiB blocks: 83.8 IO/s, 167.5 KiB/s ( 1.4 Mbit/s)
4 KiB blocks: 72.3 IO/s, 289.2 KiB/s ( 2.4 Mbit/s)
8 KiB blocks: 69.8 IO/s, 558.7 KiB/s ( 4.6 Mbit/s)
16 KiB blocks: 69.4 IO/s, 1.1 MiB/s ( 9.1 Mbit/s)
32 KiB blocks: 58.2 IO/s, 1.8 MiB/s ( 15.3 Mbit/s)
64 KiB blocks: 54.3 IO/s, 3.4 MiB/s ( 28.5 Mbit/s)
128 KiB blocks: 45.9 IO/s, 5.7 MiB/s ( 48.1 Mbit/s)
256 KiB blocks: 38.7 IO/s, 9.7 MiB/s ( 81.1 Mbit/s)
512 KiB blocks: 29.0 IO/s, 14.5 MiB/s (121.8 Mbit/s)
1 MiB blocks: 18.3 IO/s, 18.3 MiB/s (153.2 Mbit/s)
2 MiB blocks: 10.3 IO/s, 20.7 MiB/s (173.6 Mbit/s)
4 MiB blocks: 5.7 IO/s, 22.8 MiB/s (191.7 Mbit/s)
8 MiB blocks: 4.8 IO/s, 38.8 MiB/s (325.2 Mbit/s)
16 MiB blocks: 2.0 IO/s, 32.6 MiB/s (273.8 Mbit/s)
32 MiB blocks: 0.8 IO/s, 27.0 MiB/s (226.1 Mbit/s)
512 B blocks: 861.1 IO/s, 430.5 KiB/s ( 3.5 Mbit/s)
1 KiB blocks: 1084.7 IO/s, 1.1 MiB/s ( 8.9 Mbit/s)
2 KiB blocks: 836.6 IO/s, 1.6 MiB/s ( 13.7 Mbit/s)
4 KiB blocks: 698.4 IO/s, 2.7 MiB/s ( 22.9 Mbit/s)
8 KiB blocks: 755.7 IO/s, 5.9 MiB/s ( 49.5 Mbit/s)
16 KiB blocks: 909.1 IO/s, 14.2 MiB/s (119.2 Mbit/s)
32 KiB blocks: 784.9 IO/s, 24.5 MiB/s (205.7 Mbit/s)
64 KiB blocks: 747.9 IO/s, 46.7 MiB/s (392.1 Mbit/s)
128 KiB blocks: 593.2 IO/s, 74.2 MiB/s (622.0 Mbit/s)
256 KiB blocks: 441.4 IO/s, 110.4 MiB/s (925.8 Mbit/s)
512 KiB blocks: 423.3 IO/s, 211.6 MiB/s ( 1.8 Gbit/s)
1 MiB blocks: 295.1 IO/s, 295.1 MiB/s ( 2.5 Gbit/s)
2 MiB blocks: 159.1 IO/s, 318.3 MiB/s ( 2.7 Gbit/s)
4 MiB blocks: 103.2 IO/s, 412.6 MiB/s ( 3.5 Gbit/s)
8 MiB blocks: 46.6 IO/s, 372.8 MiB/s ( 3.1 Gbit/s)
16 MiB blocks: 23.4 IO/s, 374.0 MiB/s ( 3.1 Gbit/s)
32 MiB blocks: 11.9 IO/s, 381.9 MiB/s ( 3.2 Gbit/s)
Cluster Node My local VM
Cluster node is 10 times slower than a VM running on my development
workstation (the host is Core i7/32GB/1TB, guest is 3-core VM with 16GB RAM)
% of CPU time:
15% - HTML parsing
15% - Hadoop
initialization code
7% - HDFS
initialization code
22% - reducer code
18% - reading
Hadoop XML config
23% - real job
ending with,
ZipFile.getEntry(), etc.
Creating a record writer
This is Gora
library code
Most observable
function calls on
top are:*
Most of Java processes used
significantly less memory
than they were initially
• init - the initial amount of memory that the
JVM requests from the OS during startup;
• used - the amount of memory currently
• Committed - the amount of memory that is
guaranteed to be available for use by the
Java virtual machine;
• Max - represents the maximum amount of
memory (in bytes) that can be used for
memory management.
A memory allocation may fail if it attempts
to increase the used memory such that used
> committed even if used <= max would still
be true
1) Gora + HBase
Reasons: Bad code in Gora (too many metadata full table scans)
• check Gora’s configuration, dive into the code to find out why it does full scan
• try Cassandra instead of HBase
2) Hadoop Framework parts, in particular:
• HDFS initialization in mapreduce jobs (slow communication with Namenode)
• Reading configuration files (it is done with Xerces library ).
Possible Reasons:
• Bad I/O speed and bad network speed.
• There can be some parameterizing of XML parsing of config files that we’re not aware of.
• fix the hardware issues.
• Search for why Hadoop XML config parsing may be so slow
• Check namenode memory usage
Another method to get stack traces is Linux’s perf_events:
perf record -F 99 -g -p PID
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
Perf monitors:
• Hardware events (e.g. level 2 cache
• Software events (e.g. CPU migrations)
• Tracepoint events (e.g. filesystem I/O,
TCP events)
Perf can also do
• Sampling: collection of snapshots at some
frequency (by timer)
• Dynamic tracing: instrumenting code to
create events in any location (using
kprobes or uprobes frameworks)
For more details see:
Advantages of perf over java agent:
• low overhead when getting stack traces;
• combining user calls (Java) and kernel calls in one flame graph.
• Will 100% catch all Java methods (no matter that JVM may
exclude safepoint checks from hot methods)
( - a good
explanation about safepoints).
Disadvantages of perf:
• Cannot get Java’s stacktraces (it is necessary to fix frame pointer-based stack
walking in OpenJDK – done by Netflix and Twitter)
• Doesn’t see Java symbols (hex numbers instead; special agent needed to add
symbols )
• Permissions must be configured to symbol files
• It is necessary to develop a service which will launch perf, get stacktraces and
pass them to a server.
And…. it happens that Netflix’s product is open sourced…
Andrew Johnson
Software Engineer at Etsy
Previously: Explorys, Inc.
Brendann Gregg
Senior Performance Architect at Netflix
Previously: Joyent, Oracle, Sun Microsystems
• My blog article
• Etsy’s blog about JVM Profiler
• Brendan Gregg’s blog
Source code:
• My modification of StatsD JVM Profiler
• Original Etsy’s StatsD JVM Profiler
• Brendan Gregg’s FlameGraph
• InfluxDB Docs
• Overview of the JMX Technology
• JVM Tool Interface
• Systems Performance: Enterprise and the Cloud
by Brendan Gregg
• Blazing Performance with Flame Graphs
by Brendan Gregg
• Linux profiling at Netflix
by Brendan Gregg
• Profiling Java in Production
by Kaushik Srenevasan, Twitter University
Ihor Bobak
Skype: ibobak

  Ihor Bobak
Lead Software Engineer, EPAM Systems
AUGUST 27, 2015
  • 2. CONTENTS Covered topics: • What is profiling? How do profilers work? • What problems can affect performance? • How to profile a distributed application? • Gathering, storing and analysis of stack traces • Memory analysis • Use Case • Alternative approaches to profiling
  • 3. 3 WHAT IS A PROFILER? Profiler is a tool to look what parts of your app is working slowly. VisualVM YourKit
  • 4. 4 HOW DO PROFILERS WORK? • Instrumenting: adding extra bytecode to your methods for recording when they’re called and how long they execute. • Sampling: taking dumps of all the threads periodically in order to understand how much CPU time each method takes.
  • 5. 5 DIFFUCULTIES WITH A CLUSTER This is a typical mapreduce application running on a Hadoop cluster. All blue boxes are separate JVM processes running on different machines. Question: how can we profile a distributed Java app? 1. How to attach to a process running on another host? 2. How to track the appearance of new processes? 3. How to gather profiling data? 4. How to analyze this vast amount of data?
  • 6. 6 WHY DO WE NEED A CLUSTER PROFILER? Answer: we need a profiler to get more performance. Hadoop principle is next: “If you want more performance, add more hardware”. This is a truth. But this is not the only truth. Another truth is: there are problems that are related to ALL applications (both distributed and local).
  • 7. 7 PROBLEM 1: NOT OPTIMAL CODE public static void QuickSort(int[] a, int x, int y){ int pivot = (x+y)/2; int apivot = a[pivot]; int i = x; int j = y; while (i <= j){ while (a[i] < apivot) i++; while (a[j] > apivot) j--; if (i <= j){ int temp = a[i]; a[i] = a[j]; a[j] = temp; i++; j--; } } if (x < j) QuickSort(a, x, j); if (i < y) QuickSort(a, i, y); } public static void StupidSort(int[] a){ for (int i = 0; i < a.length - 1; ++i) for (int j = i + 1; j < a.length; ++j) if (a[i] > a[j]){ int temp = a[i]; a[i] = a[j]; a[j] = temp; } } “Tupo-v-lob” sort: O(N^2) Quicksort: O(N*log(N)) This is a simple example of different algorithm solving the same tasks:
  • 8. 8 PROBLEM 2: BAD CODE/DATA • Repeatedly doing the same unnecessary actions Example: re-reading the configuration file or a database table again and again during every operation (although we could cache it in the memory). • Wrong usage of someone’s code/libraries/binaries Example: sqoop can import from MySQL in two modes – direct mode (using mysqldump and mysqlimport) and JDBC- mode. The first one is faster. • Usage of wrong libraries Example: I found that famous Wintellect’s OrderedSet works 3 times slower than native Microsoft’s SortedSet. • Absense of indexes in a database Example: “select * from fact join dim on fact.productid = dim.productid” is slow because developers missed to make keys/indexes • Bugs in famous libraries/frameworks Example: problem with A->B->C tables join when enumerated in order A, C, B. This is handled fine by all database servers, but NOT by Spark SQL.
  • 9. 9 PROBLEM 3: HARDWARE TROUBLES Two most important problems are: • Disk problems (slow I/O speed) • Network problems (slow bandwidths, packets loss)
  • 10. 10 CLUSTER PROFILER ARCHITECTURE Java Process “Injected” code which does stacktrace sampling Passing stacktraces each 10 seconds though HTTP A set of Python/Perl scripts to get visualizations Visualization in the form of flame graphs This is applicable to any java process: mapper, reducer, etc., and it is applicable not only Hadoop: it can be Spark RDD code, Java web app code, etc.
  • 11. 11 HOW JAVA AGENT WORKS? • Agent is bound to a java process by specifying -javaagent parameter, e.g. java –javaagent:/path/agent.jar=parameters MainClass or by overriding _JAVA_OPTIONS like this: _JAVA_OPTIONS='-javaagent:/path/agent.jar=parameters • Agent’s jar has a manifest with PreMain-Class: namespace.TheAgentClass • “TheAgentClass” has a premain() method that executes before your main() and does the following: – Read the parameters of the agent – Constructs the profiler instances (based on parameters) – Creates a ScheduledExecutorService (see java.util.concurrent) that does scheduleAtFixedRate(worker, 0, 10, TimeUnit.SECONDS)
  • 12. 12 HOW JAVA AGENT WORKS? The profiler thread collects stacktraces 100 times per second using ThreadMXBean (a part of JMX – a technology for monitoring and managing the JVM) public void profile() { profileCount++; try{ for (ThreadInfo thread : getAllRunnableThreads()) { if (thread.getStackTrace().length > 0) { String traceKey = StackTraceFormatter.formatStackTrace(thread.getStackTrace()); if (filter.includeStackTrace(traceKey)) traces.increment(traceKey, 1); } } } catch (OutOfMemoryError ex) { // ... skipping code for handling OOM (just for safety) } if (profileCount == reportingFrequency) { profileCount = 0; recordMethodCounts(); } } For more information about JMX read here:
  • 13. 13 STATSD + MY CHANGES I made a modification of a famous StatsD JVM profiler jvm-profiler List of my changes: • Added the jvmName and host tag to each stacktrace; • Optimized performance in stacktraces collection code; • Improved stability - added catching of OutOfMemoryException; • Added statistics to show how many lines and characters we pass to the backend; • Seriously modified the now it extracts data into a set of distinct files - one file for each JVM, each host and a total. • Added extraction of memory information and rendering it with charts in R • Added a script for analysis of the method call trees • Added some helper scripts.
  • 14. 14 INFLUXD What is InfluxDB? It is a time series, metrics, and analytics database. Targeted at: gathering metrics (like response times, CPU load), sensor data, events (like exceptions) and real-time analytics. Key Features: • SQL-like query language; • HTTP(S) API for data ingestion and queries; • Built-in support for other data protocols such as collectd; • Has a CLI and web interface; • Tag data for fast and efficient queries.
  • 15. 15 Measurements (analog of tables) tag keys:values SQL-like query language timestamps Series: measurement name + tag key-values + data values
  • 16. 16 Schema exploration examples: • SHOW MEASUREMENTS shows the list of measurements • SHOW SERIES FROM /.*cpu.*/ shows the list of series for each measurement whose name matches the pattern /*.cpu.*/ • SHOW TAG KEYS FROM /.*heap.*/ shows different tag keys from measurements that match pattern • SHOW TAG VALUES FROM /.*cpu.*/ WITH KEY = jvmName shows different tag keys from measurements that match pattern Data exploration examples: • SELECT * FROM cpu WHERE host = ‘A’ selects series for “cpu” measurement with tag host=‘A’ • SELECT percentile(value, 95) FROM response_times WHERE time > now() - 1d GROUP BY time(1m) shows the 95th percentile of response times in the last day in 1 minute interval
  • 17. 17 FLAME GRAPHS D D C C C B B B B A A A A 0th ms 10th ms 20th ms 30th ms Gathered stack traces: A->B->C A->B->C->D A->B->C->D A->B D D C C C B B B B A A A A 0th ms 10th ms 20th ms 30th ms THE WIDTH OF A BAR MATTERS. Color doesn’t matter and is selected just to distinguish bars.
  • 18. 18 FLAME GRAPHS Flame graphs are a visualization of profiled software, allowing the most frequent code-paths to be identified quickly and accurately. Invented by Brendann Gregg:
  • 19. 19 SEQUENCE OF ACTIONS Steps to Profile a Cluster: 1. Install InfluxDB on a separate machine visible to all machines of the cluster. Create a database and a user. 2. Get the agent’s jar file from my blog (or from sources) and put it into /var/lib at every worker node. 3. Change the configuration of the cluster: make _JAVA_OPTIONS=‘- javaagent…’ available to all JVM processes. 4. Run your application and get the stacktraces in the InfluxDB. You may “switch off” the _JAVA_AGENT after this. 5. Get the SVG files (flame graphs) from InfluxDB with the help of and and do the analysis. These steps are described in detail at my blog
  • 21. 21 USE CASE WITH A REAL CUSTOMER The App/Inventory/Environment: •Our customer has an app that crawls data from a set of sites, parses it and puts to a Hadoop cluster (20 machines with 8 cores, 32GB RAM and 1TB HDD each). •The app leverages Apache Nutch, Cloudera Hadoop distribution version 5.3, Hbase, MongoDB and other technologies. •There is a central Java web app (Java/Tomcat) that uses Nutch which runs the mapreduce jobs. The problem: •The cluster crawls just 100 sites per day; a customer is asking us “how to make it crawl 10 times more on the same hardware?”
  • 22. 22 FIRST FINDINGS The first question that arose in my head: what exactly works slowly? At the beginning I quickly found this: slow are the parts that are I/O intensive.
  • 23. 23 DISK I/O Then I did I/O monitoring procedures and a series of test of disk speed on nodes. This is the result of IOPS benchmark 512 B blocks: 80.9 IO/s, 40.4 KiB/s (331.3 kbit/s) 1 KiB blocks: 97.9 IO/s, 97.9 KiB/s (802.1 kbit/s) 2 KiB blocks: 83.8 IO/s, 167.5 KiB/s ( 1.4 Mbit/s) 4 KiB blocks: 72.3 IO/s, 289.2 KiB/s ( 2.4 Mbit/s) 8 KiB blocks: 69.8 IO/s, 558.7 KiB/s ( 4.6 Mbit/s) 16 KiB blocks: 69.4 IO/s, 1.1 MiB/s ( 9.1 Mbit/s) 32 KiB blocks: 58.2 IO/s, 1.8 MiB/s ( 15.3 Mbit/s) 64 KiB blocks: 54.3 IO/s, 3.4 MiB/s ( 28.5 Mbit/s) 128 KiB blocks: 45.9 IO/s, 5.7 MiB/s ( 48.1 Mbit/s) 256 KiB blocks: 38.7 IO/s, 9.7 MiB/s ( 81.1 Mbit/s) 512 KiB blocks: 29.0 IO/s, 14.5 MiB/s (121.8 Mbit/s) 1 MiB blocks: 18.3 IO/s, 18.3 MiB/s (153.2 Mbit/s) 2 MiB blocks: 10.3 IO/s, 20.7 MiB/s (173.6 Mbit/s) 4 MiB blocks: 5.7 IO/s, 22.8 MiB/s (191.7 Mbit/s) 8 MiB blocks: 4.8 IO/s, 38.8 MiB/s (325.2 Mbit/s) 16 MiB blocks: 2.0 IO/s, 32.6 MiB/s (273.8 Mbit/s) 32 MiB blocks: 0.8 IO/s, 27.0 MiB/s (226.1 Mbit/s) 512 B blocks: 861.1 IO/s, 430.5 KiB/s ( 3.5 Mbit/s) 1 KiB blocks: 1084.7 IO/s, 1.1 MiB/s ( 8.9 Mbit/s) 2 KiB blocks: 836.6 IO/s, 1.6 MiB/s ( 13.7 Mbit/s) 4 KiB blocks: 698.4 IO/s, 2.7 MiB/s ( 22.9 Mbit/s) 8 KiB blocks: 755.7 IO/s, 5.9 MiB/s ( 49.5 Mbit/s) 16 KiB blocks: 909.1 IO/s, 14.2 MiB/s (119.2 Mbit/s) 32 KiB blocks: 784.9 IO/s, 24.5 MiB/s (205.7 Mbit/s) 64 KiB blocks: 747.9 IO/s, 46.7 MiB/s (392.1 Mbit/s) 128 KiB blocks: 593.2 IO/s, 74.2 MiB/s (622.0 Mbit/s) 256 KiB blocks: 441.4 IO/s, 110.4 MiB/s (925.8 Mbit/s) 512 KiB blocks: 423.3 IO/s, 211.6 MiB/s ( 1.8 Gbit/s) 1 MiB blocks: 295.1 IO/s, 295.1 MiB/s ( 2.5 Gbit/s) 2 MiB blocks: 159.1 IO/s, 318.3 MiB/s ( 2.7 Gbit/s) 4 MiB blocks: 103.2 IO/s, 412.6 MiB/s ( 3.5 Gbit/s) 8 MiB blocks: 46.6 IO/s, 372.8 MiB/s ( 3.1 Gbit/s) 16 MiB blocks: 23.4 IO/s, 374.0 MiB/s ( 3.1 Gbit/s) 32 MiB blocks: 11.9 IO/s, 381.9 MiB/s ( 3.2 Gbit/s) Cluster Node My local VM Cluster node is 10 times slower than a VM running on my development workstation (the host is Core i7/32GB/1TB, guest is 3-core VM with 16GB RAM)
  • 24. 24 FETCHER MAPREDUCE JOB % of CPU time: 15% - HTML parsing 15% - Hadoop framework initialization code 7% - HDFS initialization code 22% - reducer code (BAD NEWS HERE) 18% - reading Hadoop XML config files 23% - real job
  • 25. 25 DRILL DOWN INTO THE REDUCER org.apache.hadoop.hbase. catalog.MetatataReader. fullScan() org.apache.avro. Schema$Parser.parse() ending with, ZipFile.getEntry(), etc. org.apache.hadoop.hbase. client.HConnectionManager. createConnection() Creating a record writer Parsing avro schema Fetcher Reducer .run()
  • 26. 26 DRILL DOWN INTO THE RECORD WRITER This is Gora library code Most observable function calls on top are:* FileInputStream* FileOutputStream*
  • 28. 28 INEFFECTIVE MEMORY MANAGEMENT Most of Java processes used significantly less memory than they were initially assigned. Legend: • init - the initial amount of memory that the JVM requests from the OS during startup; • used - the amount of memory currently used; • Committed - the amount of memory that is guaranteed to be available for use by the Java virtual machine; • Max - represents the maximum amount of memory (in bytes) that can be used for memory management. A memory allocation may fail if it attempts to increase the used memory such that used > committed even if used <= max would still be true
  • 29. 29 PROBLEMS AND NEXT STEPS 1) Gora + HBase Reasons: Bad code in Gora (too many metadata full table scans) Actions: • check Gora’s configuration, dive into the code to find out why it does full scan • try Cassandra instead of HBase 2) Hadoop Framework parts, in particular: • HDFS initialization in mapreduce jobs (slow communication with Namenode) • Reading configuration files (it is done with Xerces library ). Possible Reasons: • Bad I/O speed and bad network speed. • There can be some parameterizing of XML parsing of config files that we’re not aware of. Actions: • fix the hardware issues. • Search for why Hadoop XML config parsing may be so slow • Check namenode memory usage
  • 30. 30 OTHER METHOD OF GETTING STACKTRACE Another method to get stack traces is Linux’s perf_events: perf record -F 99 -g -p PID perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5 Perf monitors: • Hardware events (e.g. level 2 cache misses); • Software events (e.g. CPU migrations) • Tracepoint events (e.g. filesystem I/O, TCP events) Perf can also do • Sampling: collection of snapshots at some frequency (by timer) • Dynamic tracing: instrumenting code to create events in any location (using kprobes or uprobes frameworks) For more details see:
  • 31. 31 PERF vs. JAVA AGENT Advantages of perf over java agent: • low overhead when getting stack traces; • combining user calls (Java) and kernel calls in one flame graph. • Will 100% catch all Java methods (no matter that JVM may exclude safepoint checks from hot methods) ( - a good explanation about safepoints). Disadvantages of perf: • Cannot get Java’s stacktraces (it is necessary to fix frame pointer-based stack walking in OpenJDK – done by Netflix and Twitter) • Doesn’t see Java symbols (hex numbers instead; special agent needed to add symbols ) • Permissions must be configured to symbol files • It is necessary to develop a service which will launch perf, get stacktraces and pass them to a server.
  • 32. 32 PERF vs. JAVA AGENT And…. it happens that Netflix’s product is open sourced…
  • 33. 33 CREDITS Andrew Johnson Software Engineer at Etsy Previously: Explorys, Inc. Brendann Gregg Senior Performance Architect at Netflix Previously: Joyent, Oracle, Sun Microsystems
  • 34. 34 BLOGS/ARTICLES Blogs: • My blog article • Etsy’s blog about JVM Profiler • Brendan Gregg’s blog Source code: • My modification of StatsD JVM Profiler • Original Etsy’s StatsD JVM Profiler • Brendan Gregg’s FlameGraph Manuals: • InfluxDB Docs • Overview of the JMX Technology • JVM Tool Interface
  • 35. 35 BOOKS / VIDEOS • Systems Performance: Enterprise and the Cloud by Brendan Gregg Enterprise-Brendan-Gregg/dp/0133390098 • Blazing Performance with Flame Graphs by Brendan Gregg • Linux profiling at Netflix by Brendan Gregg • Profiling Java in Production by Kaushik Srenevasan, Twitter University

