SlideShare a Scribd company logo
Java Performance:
Speedup your applications with hardware counters
Sergey Kuksenko
sergey.kuksenko@oracle.com, @kuksenk0
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
pmu-hwc-java
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
The following is intended to outline our general product direction. It
is intended for information purposes only, and may not be
incorporated into any contract. It is not a commitment to deliver any
material, code, or functionality, and should not be relied upon in
making purchasing decisions. The development, release, and timing
of any features or functionality described for Oracle’s products
remains at the sole discretion of Oracle.
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. 2/66
Intriguing Introductory Example
Gotcha, here is my hot method
or
On Algorithmic O ptimizations
just-O
3/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
4/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
O (N3)
big-O
4/66
Example 1: "Ok Google"
O(N2.81)
5/66
Example 1: Very Hot Code
N multiply strassen
128 0.0026 0.0023
512 0.48 0.12
2048 28 5.8
8192 3571 282
time; seconds/op
6/66
Example 1: Optimized speedup over baseline
7/66
Try the other way
8/66
Example 1: JMH, -prof perfnorm, N=256
per multiplication per iteration
CPI 1.06
cycles 146 × 106
8.7
instructions 137 × 106
8.2
L1-loads 68 × 106
4
L1-load-misses 33 × 106
2
L1-stores 0.56 × 106
L1-store-misses 38 × 103
LLC-loads 26 × 106
1.6
LLC-stores 11.6 × 103
∼ 256Kb per matrix
9/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
L1-load-misses
10/66
Example 1: Very Hot Code
public int[][] multiply(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return R;
}
10/66
Example 1: Very Hot Code
public int[][] multiplyIKJ(int[][] A, int[][] B) {
int size = A.length;
int[][] R = new int[size][size];
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
int aik = A[i][k];
for (int j = 0; j < size; j++) {
R[i][j] += aik * B[k][j];
}
}
}
return R;
}
11/66
Example 1: Very Hot Code
IJK IKJ
N multiply strassen multiply
128 0.0026 0.0023 0.0005
512 0.48 0.12 0.03
2048 28 5.8 2
8192 3571 282 264
time; seconds/op
12/66
Example 1: Very Hot Code
IJK IKJ
N multiply strassen multiply strassen
128 0.0026 0.0023 0.0005
512 0.48 0.12 0.03 0.02
2048 28 5.8 2 1.5
8192 3571 282 264 79
time; seconds/op
12/66
Example 1: Optimized speedup over baseline
13/66
Example 1: JMH, -prof perfnorm, N=256
IJK IKJ
multiply iteration multiply iteration
CPI 1.06 0.51
cycles 146 × 106
8.7 9.7 × 106
0.6
instructions 137 × 106
8.2 19 × 106
1.1
L1-loads 68 × 106
4 5.4 × 106
0.3
L1-load-misses 33 × 106
2 1.1 × 106
0.1
L1-stores 0.56 × 106
2.7 × 106
0.2
L1-store-misses 38 × 103
9 × 103
LLC-loads 26 × 106
1.6 0.3 × 106
LLC-stores 11.6 × 103
3.5 × 103
14/66
Example 1: JMH, -prof perfnorm, N=256
IJK IKJ
multiply iteration multiply iteration
CPI 1.06 0.51
cycles 146 × 106
8.7 9.7 × 106
0.6
instructions 137 × 106
8.2 19 × 106
1.1
L1-loads 68 × 106
4 5.4 × 106
0.3
L1-load-misses 33 × 106
2 1.1 × 106
0.1
L1-stores 0.56 × 106
2.7 × 106
0.2
L1-store-misses 38 × 103
9 × 103
LLC-loads 26 × 106
1.6 0.3 × 106
LLC-stores 11.6 × 103
3.5 × 103
?
14/66
Example 1: Free beer benefits!
cycles insts
...
0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5
8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5
43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5
6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore
19.82% 20.82% add $0x8,%edx ;*iinc
1.46% 2.75% cmp %ecx,%edx
jl <addr> ;*if_icmpge
...
how to get asm: JMH, -prof perfasm
15/66
Example 1: Free beer benefits!
cycles insts
...
0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5
8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5
43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5
6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore
19.82% 20.82% add $0x8,%edx ;*iinc
1.46% 2.75% cmp %ecx,%edx
jl <addr> ;*if_icmpge
...
Vectorization (SSE/AVX)!
how to get asm: JMH, -prof perfasm
15/66
Chapter 1
Performance Optimization Methodology:
A Very Short Introduction.
Three magic questions and the direction of the journey
16/66
Three magic questions
17/66
Three magic questions
What?
• What prevents my application to work faster?
(monitoring)
17/66
Three magic questions
What? ⇒ Where?
• What prevents my application to work faster?
(monitoring)
• Where does it hide?
(profiling)
17/66
Three magic questions
What? ⇒ Where? ⇒ How?
• What prevents my application to work faster?
(monitoring)
• Where does it hide?
(profiling)
• How to stop it messing with performance?
(tuning/optimizing)
17/66
Top-Down Approach
• System Level
– Network, Disk, OS, CPU/Memory
• JVM Level
– GC/Heap, JIT, Classloading
• Application Level
– Algorithms, Synchronization, Threading, API
• Microarchitecture Level
– Caches, Data/Code alignment, CPU Pipeline Stalls
18/66
Methodology in Essence
19/66
Methodology in Essence
http://j.mp/PerfMindMap
19/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
• Lots of %iowait ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
• Lots of %iowait ⇒ . . .
⇓
• Lots of %idle ⇒ . . .
⇓
20/66
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
• Lots of %sys ⇒ . . .
⇓
• Lots of %irq, %soft ⇒ . . .
⇓
• Lots of %iowait ⇒ . . .
⇓
• Lots of %idle ⇒ . . .
⇓
• Lots of %user
20/66
Chapter 2
High CPU Load
and
the main question:
«who/what is to blame?»
21/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
22/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
22/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
22/66
CPU Utilization
• What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
Profiler shows «WHERE» application time is spent,
but there’s no answer to the question «WHY».
traditional profiler
22/66
Who/What is to blame?
Complex CPU microarchitecture:
• Inefficient algorithm ⇒ 100% CPU
• Pipeline stall due to memory load/stores ⇒ 100% CPU
• Pipeline flush due to mispredicted branch ⇒ 100% CPU
• Expensive instructions ⇒ 100% CPU
• Insufficient ILP ⇒ 100% CPU
• etc. ⇒ 100% CPU
Instruction Level Parallelism
23/66
Chapter 3
Hardware Counters
HWC, PMU - WTF?
24/66
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
25/66
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
PMU Internals (in less than 21 seconds) :
Hardware counters (HWC) count hardware performance events
(performance monitoring events)
25/66
Events
Vendor’s documentation! (e.g. 2 pages from 32)
26/66
Events: Issues
• Hundreds of events
• Microarchitectural experience is required
• Platform dependent
– vary from CPU vendor to vendor
– may vary when CPU manufacturer introduces new microarchitecture
• How to work with HWC?
27/66
HWC
HWC modes:
• Counting mode
– if (event_happened) ++counter;
– general monitoring (answering «WHAT?» )
• Sampling mode
– if (event_happened)
if (++counter < threshold) INTERRUPT;
– profiling (answering «WHERE?»)
28/66
HWC: Issues
So many events, so few counters
e.g. “Nehalem”:
– over 900 events
– 7 HWC (3 fixed + 4 programmable)
• «multiple-running» (different events)
– Repeatability
• «multiplexing» (only a few tools are able to do that)
– Steady state
29/66
HWC: Issues (cont.)
Sampling mode:
• «instruction skid»
(hard to correlate event and instruction)
• Uncore events
(hard to bind event and execution thread;
e.g. shared L3 cache)
30/66
HWC: typical usages
• hardware validation
• performance analysis
31/66
HWC: typical usages
• hardware validation
• performance analysis
• run-time tuning (e.g. JRockit, etc.)
• security attacks and defenses
• test code coverage
• etc.
31/66
HWC: tools
• Oracle Solaris Studio Performance Analyzer
http://www.oracle.com/technetwork/server-storage/solarisstudio
• perf/perf_events
http://perf.wiki.kernel.org
• JMH (Java Microbenchmark Harness)
http://openjdk.java.net/projects/code-tools/jmh/)
-prof perf
-prof perfnorm = perf, normalized per operation
-prof perfasm = perf + -XX:+PrintAssembly
32/66
HWC: tools (cont.)
• AMD CodeXL
• Intel® Vtune™ Amplifier
• etc. . .
33/66
perf events (e.g.)
• cycles
• instrustions
• cache-references
• cache-misses
• branches
• branch-misses
• bus-cycles
• ref-cycles
• dTLB-loads
• dTLB-load-misses
• L1-dcache-loads
• L1-dcache-load-misses
• L1-dcache-stores
• L1-dcache-store-misses
• LLC-loads
• etc...
34/66
Oracle Studio events (e.g.)
• cycles
• insts
• branch-instruction-retired
• branch-misses-retired
• dtlbm
• l1h
• l1m
• l2h
• l2m
• l3h
• l3m
• etc...
35/66
Chapter 4
I’ve got HWC data. What’s next?
or
Introduction to microarchitecture
performance analysis
36/66
Execution time
tme =
cyces
ƒreqency
Optimization is. . .
reducing the (spent) cycle count!
everything else is overclocking
37/66
Microarchitecture Equation
cyces = PthLength ∗ CP = PthLength ∗ 1
PC
• PthLength - number of instructions
• CP - cycles per instruction
• PC - instructions per cycle
38/66
PthLength ∗ CP
• PthLength ∼ algorithm efficiency (the smaller the better)
• CP ∼ CPU efficiency (the smaller the better)
– CP = 4 – bad!
– CP = 1
• Nehalem – just good enough!
• SandyBridge and later – not so good!
– CP = 0.4 – good!
– CP = 0.2 – ideal!
39/66
What to do?
• low CP
– reduce PthLength → «tune algorithm»
• high CP ⇒ CPU stalls
– memory stalls → «tune data structures»
– branch stalls → «tune control logic»
– instruction dependency → «break dependency chains»
– long latency ops → «use more simple operations»
– etc. . .
40/66
High CPI: Memory bound
• dTLB misses
• L1,L2,L3,...,LN misses
• NUMA: non-local memory access
• memory bandwidth
• false/true sharing
• cache line split (not in Java world, except. . . )
• store forwarding (unlikely, hard to fix on Java level)
• 4K aliasing
41/66
High CPI: Core bound
• long latency operations: DIV, SQRT
• FP assist: floating points denormal, NaN, inf
• bad speculation (caused by mispredicted branch)
• port saturation (forget it)
42/66
High CPI: Front-End bound
• iTLB miss
• iCache miss
• branch mispredict
• LSD (loop stream decoder)
solvable by HotSpot tweaking
43/66
Example 2
Some «large» standard server-side
Java benchmark
44/66
Example 2: perf stat -d
880851.993237 task-clock (msec)
39,318 context-switches
437 cpu-migrations
7,931 page-faults
2,277,063,113,376 cycles
1,226,299,634,877 instructions # 0.54 insns per cycle
229,500,265,931 branches # 260.544 M/sec
4,620,666,169 branch-misses # 2.01% of all branches
338,169,489,902 L1-dcache-loads # 383.912 M/sec
37,937,596,505 L1-dcache-load-misses # 11.22% of all L1-dcache hits
25,232,434,666 LLC-loads # 28.645 M/sec
7,307,884,874 L1-icache-load-misses # 0.00% of all L1-icache hits
337,730,278,697 dTLB-loads # 382.846 M/sec
6,094,356,801 dTLB-load-misses # 1.81% of all dTLB cache hits
12,210,841,909 iTLB-loads # 13.863 M/sec
431,803,270 iTLB-load-misses # 3.54% of all iTLB cache hits
301.557213044 seconds time elapsed
45/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Potential Issues?
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Let’s count:
4×(338 × 109
)+
12×((38− 25)× 109
)+
36×(25 × 109
) ≈
2408 × 109
cycles ???
Potential Issues?
Cache latencies (L1/L2/L3) = 4/12/36 cycles
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Let’s count:
4×(338 × 109
)+
12×((38− 25)× 109
)+
36×(25 × 109
) ≈
2408 × 109
cycles ???Superscalar in Action!
46/66
Example 2: Processed perf stat -d
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
Issue!
46/66
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
• a memory cache that stores recent translations of virtual
memory addresses to physical addresses
• each memory access → access to TLB
• TLB miss may take hundreds cycles
47/66
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
• a memory cache that stores recent translations of virtual
memory addresses to physical addresses
• each memory access → access to TLB
• TLB miss may take hundreds cycles
How to check?
• dtlb_load_misses_miss_causes_a_walk
• dtlb_load_misses_walk_duration
47/66
Example 2: dTLB misses
CPI 1.85
cycles ×109 2277
instructions ×109 1226
L1-dcache-loads ×109 338
L1-dcache-load-misses ×109 38
LLC-loads ×109 25
dTLB-loads ×109 338
dTLB-load-misses ×109 6
dTLB-walks-duration ×109 296
13%
of cycles (time)
in cycles
48/66
Example 2: dTLB misses
Fixing it:
• Try to shrink application working set
(modify you application)
• Enable -XX:+UseLargePages
(modify you execution scripts)
49/66
Example 2: dTLB misses
Fixing it:
• Enable -XX:+UseLargePages
(modify you execution scripts)
20% boost on the benchmark!
49/66
Example 2: -XX:+UseLargePages
baseline large pages
CPI 1.85 1.56
cycles ×109 2277 2277
instructions ×109 1226 1460
L1-dcache-loads ×109 338 401
L1-dcache-load-misses ×109 38 38
LLC-loads ×109 25 25
dTLB-loads ×109 338 401
dTLB-load-misses ×109 6 0.24
dTLB-walks-duration ×109 296 2.6
50/66
Example 2: normalized per transaction
baseline large pages
CPI 1.85 1.56
cycles ×106 23.4 19.5
instructions ×106 12.5 12.5
L1-dcache-loads ×106 3.45 3.45
L1-dcache-load-misses ×106 0.39 0.33
LLC-loads ×106 0.26 0.21
dTLB-loads ×106 3.45 3.45
dTLB-load-misses ×106 0.06 0.002
dTLB-walks-duration ×106 3.04 0.022
51/66
Example 3
«A plague on both your houses»
«++ on both your threads»
52/66
Example 3: False Sharing
• False Sharing:
https://en.wikipedia.org/wiki/False_sharing
• @Contended (JEP 142)
http://openjdk.java.net/jeps/142
53/66
Example 3: False Sharing
@State(Scope.Group)
public static class StateBaseline {
int field0;
int field1;
}
@Benchmark
@Group("baseline")
public int justDoIt(StateBaseline s) {
return s.field0++;
}
@Benchmark
@Group("baseline")
public int doItFromOtherThread(StateBaseline s) {
return s.field1++;
}
54/66
Example 3: Measure
resource sharing padded
Same Core (HT) L1 9.5 4.9
Diff Cores (within socket) L3 (LLC) 10.6 2.8
Diff Sockets nothing 18.2 2.8
average time, ns/op
shared between threads
55/66
Example 3: What do HWC tell us?
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
L1-loads 12593 9467 9696 8973 9672 9016
L1-load-misses 10 5 12 4 33 3
L1-stores 4317 7838 7433 4069 6935 4074
L1-store-misses 5 2 161 2 55 1
LLС-loads 4 3 58 1 32 1
LLC-load-misses 1 1 53 ≈ 0 35 ≈ 0
LLC-stores 1 1 183 ≈ 0 49 ≈ 0
LLC-store-misses 1 ≈ 0 182 ≈ 0 48 ≈ 0
All values are normalized per 103 operations
56/66
Example 3: «on a core and a prayer»
in case of the single core(HT) we have to look into
MACHINE_CLEARS.MEMORY_ORDERING
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
CLEARS 238 ≈ 0 ≈ 0 ≈ 0 ≈ 0 ≈ 0
57/66
Example 3: Diff Cores
L2_STORE_LOCK_RQSTS - L2 RFOs breakdown
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
LLC-stores 183 ≈ 0 49 ≈ 0
LLC-store-misses 182 ≈ 0 48 ≈ 0
L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0
L2_STORE_LOCK_RQSTS.HIT_E ≈ 0 ≈ 0 ≈ 0 ≈ 0
L2_STORE_LOCK_RQSTS.HIT_M ≈ 0 ≈ 0 ≈ 0 ≈ 0
L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0
58/66
Question!
To the audience
59/66
Example 3: Diff Cores
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
LLC-stores 183 ≈ 0 49 ≈ 0
LLC-store-misses 182 ≈ 0 48 ≈ 0
L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0
L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0
Why 183 > 49 & 134 > 33,
but the same socket case is faster?
60/66
Example 3: Some events count duration
For example:
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
cycles 36012 9163 46484 9608
instructions 26550 25747 26717 25768
O_R_O.CYCLES_W_D_RFO 21723 10 29601 56
61/66
Summary
Summary: "Performance is easy"
To achieve high performance:
• You have to know your Application!
• You have to know your Frameworks!
• You have to know your Virtual Machine!
• You have to know your Operating System!
• You have to know your Hardware!
63/66
Enlarge your knowledge with these simple tricks!
Reading list:
• “Computer Architecture: A Quantitative Approach”
John L. Hennessy, David A. Patterson
• CPU vendors documentation
• http://www.agner.org/optimize/
• http://www.google.com/search?q=Hardware+
performance+counter
• etc. . .
64/66
Thanks!
65/66
Q & A ?
66/66
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
pmu-hwc-java

More Related Content

What's hot

Java9を迎えた今こそ!Java本格(再)入門
Java9を迎えた今こそ!Java本格(再)入門Java9を迎えた今こそ!Java本格(再)入門
Java9を迎えた今こそ!Java本格(再)入門
Takuya Okada
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Mr. Vengineer
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
Satpal Parmar
 
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
Pixie Labs
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
mrphilroth
 
Ember
EmberEmber
Ember
mrphilroth
 
Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the Seal
Tzung-Bi Shih
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
Sergey Platonov
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Jim Dowling
 
TVM VTA (TSIM)
TVM VTA (TSIM) TVM VTA (TSIM)
TVM VTA (TSIM)
Mr. Vengineer
 
Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...
DATA SECURITY SOLUTIONS
 
Return Oriented Programming (ROP) Based Exploits - Part I
Return Oriented Programming  (ROP) Based Exploits  - Part IReturn Oriented Programming  (ROP) Based Exploits  - Part I
Return Oriented Programming (ROP) Based Exploits - Part I
n|u - The Open Security Community
 
Non-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itNon-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need it
Alexey Fyodorov
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
Doug Burns
 
Multiply your Testing Effectiveness with Parameterized Testing, v1
Multiply your Testing Effectiveness with Parameterized Testing, v1Multiply your Testing Effectiveness with Parameterized Testing, v1
Multiply your Testing Effectiveness with Parameterized Testing, v1
Brian Okken
 
DTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database ForumDTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database ForumDoug Burns
 

What's hot (20)

Java9を迎えた今こそ!Java本格(再)入門
Java9を迎えた今こそ!Java本格(再)入門Java9を迎えた今こそ!Java本格(再)入門
Java9を迎えた今こそ!Java本格(再)入門
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Ember
EmberEmber
Ember
 
Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the Seal
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
TVM VTA (TSIM)
TVM VTA (TSIM) TVM VTA (TSIM)
TVM VTA (TSIM)
 
Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...
 
Return Oriented Programming (ROP) Based Exploits - Part I
Return Oriented Programming  (ROP) Based Exploits  - Part IReturn Oriented Programming  (ROP) Based Exploits  - Part I
Return Oriented Programming (ROP) Based Exploits - Part I
 
Non-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itNon-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need it
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Multiply your Testing Effectiveness with Parameterized Testing, v1
Multiply your Testing Effectiveness with Parameterized Testing, v1Multiply your Testing Effectiveness with Parameterized Testing, v1
Multiply your Testing Effectiveness with Parameterized Testing, v1
 
DTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database ForumDTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database Forum
 

Similar to Speedup Your Java Apps with Hardware Counters

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
How to build a feedback loop in software
How to build a feedback loop in softwareHow to build a feedback loop in software
How to build a feedback loop in software
Sandeep Joshi
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
DataStax Academy
 
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Compuware
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
ssuserdfc773
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
BharathiLakshmiAAssi
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
Bharathi Lakshmi Pon
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniques
Hardik Devani
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
Bryan Reinero
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Reducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAReducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGA
nehagaur339
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
C4Media
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
DeepakJangid87
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
Yinghai Lu
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
Avleen Vig
 
10 Tips for Tuning of Pid Looops
10 Tips for Tuning of Pid Looops10 Tips for Tuning of Pid Looops
10 Tips for Tuning of Pid Looops
Living Online
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
WMG centre High Value Manufacturing Catapult
 

Similar to Speedup Your Java Apps with Hardware Counters (20)

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
How to build a feedback loop in software
How to build a feedback loop in softwareHow to build a feedback loop in software
How to build a feedback loop in software
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniques
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Reducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAReducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGA
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
10 Tips for Tuning of Pid Looops
10 Tips for Tuning of Pid Looops10 Tips for Tuning of Pid Looops
10 Tips for Tuning of Pid Looops
 
Energy saving policies final
Energy saving policies finalEnergy saving policies final
Energy saving policies final
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Speedup Your Java Apps with Hardware Counters

  • 1. Java Performance: Speedup your applications with hardware counters Sergey Kuksenko sergey.kuksenko@oracle.com, @kuksenk0
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ pmu-hwc-java
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2016, Oracle and/or its affiliates. All rights reserved. 2/66
  • 5. Intriguing Introductory Example Gotcha, here is my hot method or On Algorithmic O ptimizations just-O 3/66
  • 6. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } 4/66
  • 7. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } O (N3) big-O 4/66
  • 8. Example 1: "Ok Google" O(N2.81) 5/66
  • 9. Example 1: Very Hot Code N multiply strassen 128 0.0026 0.0023 512 0.48 0.12 2048 28 5.8 8192 3571 282 time; seconds/op 6/66
  • 10. Example 1: Optimized speedup over baseline 7/66
  • 11. Try the other way 8/66
  • 12. Example 1: JMH, -prof perfnorm, N=256 per multiplication per iteration CPI 1.06 cycles 146 × 106 8.7 instructions 137 × 106 8.2 L1-loads 68 × 106 4 L1-load-misses 33 × 106 2 L1-stores 0.56 × 106 L1-store-misses 38 × 103 LLC-loads 26 × 106 1.6 LLC-stores 11.6 × 103 ∼ 256Kb per matrix 9/66
  • 13. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } L1-load-misses 10/66
  • 14. Example 1: Very Hot Code public int[][] multiply(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return R; } 10/66
  • 15. Example 1: Very Hot Code public int[][] multiplyIKJ(int[][] A, int[][] B) { int size = A.length; int[][] R = new int[size][size]; for (int i = 0; i < size; i++) { for (int k = 0; k < size; k++) { int aik = A[i][k]; for (int j = 0; j < size; j++) { R[i][j] += aik * B[k][j]; } } } return R; } 11/66
  • 16. Example 1: Very Hot Code IJK IKJ N multiply strassen multiply 128 0.0026 0.0023 0.0005 512 0.48 0.12 0.03 2048 28 5.8 2 8192 3571 282 264 time; seconds/op 12/66
  • 17. Example 1: Very Hot Code IJK IKJ N multiply strassen multiply strassen 128 0.0026 0.0023 0.0005 512 0.48 0.12 0.03 0.02 2048 28 5.8 2 1.5 8192 3571 282 264 79 time; seconds/op 12/66
  • 18. Example 1: Optimized speedup over baseline 13/66
  • 19. Example 1: JMH, -prof perfnorm, N=256 IJK IKJ multiply iteration multiply iteration CPI 1.06 0.51 cycles 146 × 106 8.7 9.7 × 106 0.6 instructions 137 × 106 8.2 19 × 106 1.1 L1-loads 68 × 106 4 5.4 × 106 0.3 L1-load-misses 33 × 106 2 1.1 × 106 0.1 L1-stores 0.56 × 106 2.7 × 106 0.2 L1-store-misses 38 × 103 9 × 103 LLC-loads 26 × 106 1.6 0.3 × 106 LLC-stores 11.6 × 103 3.5 × 103 14/66
  • 20. Example 1: JMH, -prof perfnorm, N=256 IJK IKJ multiply iteration multiply iteration CPI 1.06 0.51 cycles 146 × 106 8.7 9.7 × 106 0.6 instructions 137 × 106 8.2 19 × 106 1.1 L1-loads 68 × 106 4 5.4 × 106 0.3 L1-load-misses 33 × 106 2 1.1 × 106 0.1 L1-stores 0.56 × 106 2.7 × 106 0.2 L1-store-misses 38 × 103 9 × 103 LLC-loads 26 × 106 1.6 0.3 × 106 LLC-stores 11.6 × 103 3.5 × 103 ? 14/66
  • 21. Example 1: Free beer benefits! cycles insts ... 0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5 8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5 43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5 6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore 19.82% 20.82% add $0x8,%edx ;*iinc 1.46% 2.75% cmp %ecx,%edx jl <addr> ;*if_icmpge ... how to get asm: JMH, -prof perfasm 15/66
  • 22. Example 1: Free beer benefits! cycles insts ... 0.15% 0.17% <addr>: vmovdqu 0x10(%rax,%rdx,4),%ymm5 8.83% 9.02% vpmulld %ymm4,%ymm5,%ymm5 43.60% 42.03% vpaddd 0x10(%r9,%rdx,4),%ymm5,%ymm5 6.26% 6.24% vmovdqu %ymm5,0x10(%r9,%rdx,4) ;*iastore 19.82% 20.82% add $0x8,%edx ;*iinc 1.46% 2.75% cmp %ecx,%edx jl <addr> ;*if_icmpge ... Vectorization (SSE/AVX)! how to get asm: JMH, -prof perfasm 15/66
  • 23. Chapter 1 Performance Optimization Methodology: A Very Short Introduction. Three magic questions and the direction of the journey 16/66
  • 25. Three magic questions What? • What prevents my application to work faster? (monitoring) 17/66
  • 26. Three magic questions What? ⇒ Where? • What prevents my application to work faster? (monitoring) • Where does it hide? (profiling) 17/66
  • 27. Three magic questions What? ⇒ Where? ⇒ How? • What prevents my application to work faster? (monitoring) • Where does it hide? (profiling) • How to stop it messing with performance? (tuning/optimizing) 17/66
  • 28. Top-Down Approach • System Level – Network, Disk, OS, CPU/Memory • JVM Level – GC/Heap, JIT, Classloading • Application Level – Algorithms, Synchronization, Threading, API • Microarchitecture Level – Caches, Data/Code alignment, CPU Pipeline Stalls 18/66
  • 31. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) 20/66
  • 32. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ 20/66
  • 33. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ 20/66
  • 34. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ • Lots of %iowait ⇒ . . . ⇓ 20/66
  • 35. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ • Lots of %iowait ⇒ . . . ⇓ • Lots of %idle ⇒ . . . ⇓ 20/66
  • 36. Let’s go (Top-Down) Do system monitoring (e.g. mpstat) • Lots of %sys ⇒ . . . ⇓ • Lots of %irq, %soft ⇒ . . . ⇓ • Lots of %iowait ⇒ . . . ⇓ • Lots of %idle ⇒ . . . ⇓ • Lots of %user 20/66
  • 37. Chapter 2 High CPU Load and the main question: «who/what is to blame?» 21/66
  • 38. CPU Utilization • What does ∼100% CPU Utilization mean? 22/66
  • 39. CPU Utilization • What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule 22/66
  • 40. CPU Utilization • What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? 22/66
  • 41. CPU Utilization • What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? Profiler shows «WHERE» application time is spent, but there’s no answer to the question «WHY». traditional profiler 22/66
  • 42. Who/What is to blame? Complex CPU microarchitecture: • Inefficient algorithm ⇒ 100% CPU • Pipeline stall due to memory load/stores ⇒ 100% CPU • Pipeline flush due to mispredicted branch ⇒ 100% CPU • Expensive instructions ⇒ 100% CPU • Insufficient ILP ⇒ 100% CPU • etc. ⇒ 100% CPU Instruction Level Parallelism 23/66
  • 44. PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. 25/66
  • 45. PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. PMU Internals (in less than 21 seconds) : Hardware counters (HWC) count hardware performance events (performance monitoring events) 25/66
  • 46. Events Vendor’s documentation! (e.g. 2 pages from 32) 26/66
  • 47. Events: Issues • Hundreds of events • Microarchitectural experience is required • Platform dependent – vary from CPU vendor to vendor – may vary when CPU manufacturer introduces new microarchitecture • How to work with HWC? 27/66
  • 48. HWC HWC modes: • Counting mode – if (event_happened) ++counter; – general monitoring (answering «WHAT?» ) • Sampling mode – if (event_happened) if (++counter < threshold) INTERRUPT; – profiling (answering «WHERE?») 28/66
  • 49. HWC: Issues So many events, so few counters e.g. “Nehalem”: – over 900 events – 7 HWC (3 fixed + 4 programmable) • «multiple-running» (different events) – Repeatability • «multiplexing» (only a few tools are able to do that) – Steady state 29/66
  • 50. HWC: Issues (cont.) Sampling mode: • «instruction skid» (hard to correlate event and instruction) • Uncore events (hard to bind event and execution thread; e.g. shared L3 cache) 30/66
  • 51. HWC: typical usages • hardware validation • performance analysis 31/66
  • 52. HWC: typical usages • hardware validation • performance analysis • run-time tuning (e.g. JRockit, etc.) • security attacks and defenses • test code coverage • etc. 31/66
  • 53. HWC: tools • Oracle Solaris Studio Performance Analyzer http://www.oracle.com/technetwork/server-storage/solarisstudio • perf/perf_events http://perf.wiki.kernel.org • JMH (Java Microbenchmark Harness) http://openjdk.java.net/projects/code-tools/jmh/) -prof perf -prof perfnorm = perf, normalized per operation -prof perfasm = perf + -XX:+PrintAssembly 32/66
  • 54. HWC: tools (cont.) • AMD CodeXL • Intel® Vtune™ Amplifier • etc. . . 33/66
  • 55. perf events (e.g.) • cycles • instrustions • cache-references • cache-misses • branches • branch-misses • bus-cycles • ref-cycles • dTLB-loads • dTLB-load-misses • L1-dcache-loads • L1-dcache-load-misses • L1-dcache-stores • L1-dcache-store-misses • LLC-loads • etc... 34/66
  • 56. Oracle Studio events (e.g.) • cycles • insts • branch-instruction-retired • branch-misses-retired • dtlbm • l1h • l1m • l2h • l2m • l3h • l3m • etc... 35/66
  • 57. Chapter 4 I’ve got HWC data. What’s next? or Introduction to microarchitecture performance analysis 36/66
  • 58. Execution time tme = cyces ƒreqency Optimization is. . . reducing the (spent) cycle count! everything else is overclocking 37/66
  • 59. Microarchitecture Equation cyces = PthLength ∗ CP = PthLength ∗ 1 PC • PthLength - number of instructions • CP - cycles per instruction • PC - instructions per cycle 38/66
  • 60. PthLength ∗ CP • PthLength ∼ algorithm efficiency (the smaller the better) • CP ∼ CPU efficiency (the smaller the better) – CP = 4 – bad! – CP = 1 • Nehalem – just good enough! • SandyBridge and later – not so good! – CP = 0.4 – good! – CP = 0.2 – ideal! 39/66
  • 61. What to do? • low CP – reduce PthLength → «tune algorithm» • high CP ⇒ CPU stalls – memory stalls → «tune data structures» – branch stalls → «tune control logic» – instruction dependency → «break dependency chains» – long latency ops → «use more simple operations» – etc. . . 40/66
  • 62. High CPI: Memory bound • dTLB misses • L1,L2,L3,...,LN misses • NUMA: non-local memory access • memory bandwidth • false/true sharing • cache line split (not in Java world, except. . . ) • store forwarding (unlikely, hard to fix on Java level) • 4K aliasing 41/66
  • 63. High CPI: Core bound • long latency operations: DIV, SQRT • FP assist: floating points denormal, NaN, inf • bad speculation (caused by mispredicted branch) • port saturation (forget it) 42/66
  • 64. High CPI: Front-End bound • iTLB miss • iCache miss • branch mispredict • LSD (loop stream decoder) solvable by HotSpot tweaking 43/66
  • 65. Example 2 Some «large» standard server-side Java benchmark 44/66
  • 66. Example 2: perf stat -d 880851.993237 task-clock (msec) 39,318 context-switches 437 cpu-migrations 7,931 page-faults 2,277,063,113,376 cycles 1,226,299,634,877 instructions # 0.54 insns per cycle 229,500,265,931 branches # 260.544 M/sec 4,620,666,169 branch-misses # 2.01% of all branches 338,169,489,902 L1-dcache-loads # 383.912 M/sec 37,937,596,505 L1-dcache-load-misses # 11.22% of all L1-dcache hits 25,232,434,666 LLC-loads # 28.645 M/sec 7,307,884,874 L1-icache-load-misses # 0.00% of all L1-icache hits 337,730,278,697 dTLB-loads # 382.846 M/sec 6,094,356,801 dTLB-load-misses # 1.81% of all dTLB cache hits 12,210,841,909 iTLB-loads # 13.863 M/sec 431,803,270 iTLB-load-misses # 3.54% of all iTLB cache hits 301.557213044 seconds time elapsed 45/66
  • 67. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 46/66
  • 68. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Potential Issues? 46/66
  • 69. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Let’s count: 4×(338 × 109 )+ 12×((38− 25)× 109 )+ 36×(25 × 109 ) ≈ 2408 × 109 cycles ??? Potential Issues? Cache latencies (L1/L2/L3) = 4/12/36 cycles 46/66
  • 70. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Let’s count: 4×(338 × 109 )+ 12×((38− 25)× 109 )+ 36×(25 × 109 ) ≈ 2408 × 109 cycles ???Superscalar in Action! 46/66
  • 71. Example 2: Processed perf stat -d CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Issue! 46/66
  • 72. Example 2: dTLB misses TLB = Translation Lookaside Buffer • a memory cache that stores recent translations of virtual memory addresses to physical addresses • each memory access → access to TLB • TLB miss may take hundreds cycles 47/66
  • 73. Example 2: dTLB misses TLB = Translation Lookaside Buffer • a memory cache that stores recent translations of virtual memory addresses to physical addresses • each memory access → access to TLB • TLB miss may take hundreds cycles How to check? • dtlb_load_misses_miss_causes_a_walk • dtlb_load_misses_walk_duration 47/66
  • 74. Example 2: dTLB misses CPI 1.85 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 dTLB-walks-duration ×109 296 13% of cycles (time) in cycles 48/66
  • 75. Example 2: dTLB misses Fixing it: • Try to shrink application working set (modify you application) • Enable -XX:+UseLargePages (modify you execution scripts) 49/66
  • 76. Example 2: dTLB misses Fixing it: • Enable -XX:+UseLargePages (modify you execution scripts) 20% boost on the benchmark! 49/66
  • 77. Example 2: -XX:+UseLargePages baseline large pages CPI 1.85 1.56 cycles ×109 2277 2277 instructions ×109 1226 1460 L1-dcache-loads ×109 338 401 L1-dcache-load-misses ×109 38 38 LLC-loads ×109 25 25 dTLB-loads ×109 338 401 dTLB-load-misses ×109 6 0.24 dTLB-walks-duration ×109 296 2.6 50/66
  • 78. Example 2: normalized per transaction baseline large pages CPI 1.85 1.56 cycles ×106 23.4 19.5 instructions ×106 12.5 12.5 L1-dcache-loads ×106 3.45 3.45 L1-dcache-load-misses ×106 0.39 0.33 LLC-loads ×106 0.26 0.21 dTLB-loads ×106 3.45 3.45 dTLB-load-misses ×106 0.06 0.002 dTLB-walks-duration ×106 3.04 0.022 51/66
  • 79. Example 3 «A plague on both your houses» «++ on both your threads» 52/66
  • 80. Example 3: False Sharing • False Sharing: https://en.wikipedia.org/wiki/False_sharing • @Contended (JEP 142) http://openjdk.java.net/jeps/142 53/66
  • 81. Example 3: False Sharing @State(Scope.Group) public static class StateBaseline { int field0; int field1; } @Benchmark @Group("baseline") public int justDoIt(StateBaseline s) { return s.field0++; } @Benchmark @Group("baseline") public int doItFromOtherThread(StateBaseline s) { return s.field1++; } 54/66
  • 82. Example 3: Measure resource sharing padded Same Core (HT) L1 9.5 4.9 Diff Cores (within socket) L3 (LLC) 10.6 2.8 Diff Sockets nothing 18.2 2.8 average time, ns/op shared between threads 55/66
  • 83. Example 3: What do HWC tell us? Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 L1-loads 12593 9467 9696 8973 9672 9016 L1-load-misses 10 5 12 4 33 3 L1-stores 4317 7838 7433 4069 6935 4074 L1-store-misses 5 2 161 2 55 1 LLС-loads 4 3 58 1 32 1 LLC-load-misses 1 1 53 ≈ 0 35 ≈ 0 LLC-stores 1 1 183 ≈ 0 49 ≈ 0 LLC-store-misses 1 ≈ 0 182 ≈ 0 48 ≈ 0 All values are normalized per 103 operations 56/66
  • 84. Example 3: «on a core and a prayer» in case of the single core(HT) we have to look into MACHINE_CLEARS.MEMORY_ORDERING Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 CLEARS 238 ≈ 0 ≈ 0 ≈ 0 ≈ 0 ≈ 0 57/66
  • 85. Example 3: Diff Cores L2_STORE_LOCK_RQSTS - L2 RFOs breakdown Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 ≈ 0 49 ≈ 0 LLC-store-misses 182 ≈ 0 48 ≈ 0 L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0 L2_STORE_LOCK_RQSTS.HIT_E ≈ 0 ≈ 0 ≈ 0 ≈ 0 L2_STORE_LOCK_RQSTS.HIT_M ≈ 0 ≈ 0 ≈ 0 ≈ 0 L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0 58/66
  • 87. Example 3: Diff Cores Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 ≈ 0 49 ≈ 0 LLC-store-misses 182 ≈ 0 48 ≈ 0 L2_STORE_LOCK_RQSTS.MISS 134 ≈ 0 33 ≈ 0 L2_STORE_LOCK_RQSTS.ALL 183 ≈ 0 49 ≈ 0 Why 183 > 49 & 134 > 33, but the same socket case is faster? 60/66
  • 88. Example 3: Some events count duration For example: OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 cycles 36012 9163 46484 9608 instructions 26550 25747 26717 25768 O_R_O.CYCLES_W_D_RFO 21723 10 29601 56 61/66
  • 90. Summary: "Performance is easy" To achieve high performance: • You have to know your Application! • You have to know your Frameworks! • You have to know your Virtual Machine! • You have to know your Operating System! • You have to know your Hardware! 63/66
  • 91. Enlarge your knowledge with these simple tricks! Reading list: • “Computer Architecture: A Quantitative Approach” John L. Hennessy, David A. Patterson • CPU vendors documentation • http://www.agner.org/optimize/ • http://www.google.com/search?q=Hardware+ performance+counter • etc. . . 64/66
  • 93. Q & A ? 66/66
  • 94. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ pmu-hwc-java