Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
@kuksenk0#JavaPerfAndHWC
Java Performance:
Speedup your applications with hardware counters
Sergey Kuksenko
@kuksenk0
@kuksenk0#JavaPerfAndHWC
The following is intended to outline our general product
direction. It is intended for informatio...
@kuksenk0#JavaPerfAndHWC
Intriguing Introductory Example
Gotcha, here is my hot method
or
On Algorithmic O⋆
ptimizations
⋆...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix re...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix re...
@kuksenk0#JavaPerfAndHWC
Example 1: "Ok Google"
𝑂( 𝑁2.81)
5/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
N multiply strassen
32 33 πœ‡π‘ 
64 252 πœ‡π‘ 
128 3019 πœ‡π‘  2119 πœ‡π‘ 
256 56 π‘šπ‘  15 ...
@kuksenk0#JavaPerfAndHWC
Try the other way
7/64
@kuksenk0#JavaPerfAndHWC
Example 1: JMH, -prof perfnorm, N=256⋆
per multiplication per iteration
CPI 1.06
cycles 146 Γ— 106...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix re...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix re...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiplyIKJ(Matrix other) {
int size = data.length;
Matrix...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
N multIJK strassenIJK multIKJ
32 33 πœ‡π‘  13 πœ‡π‘ 
64 252 πœ‡π‘  80 πœ‡π‘ 
128 3019 πœ‡π‘ ...
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
N multIJK strassenIJK multIKJ strassenIKJ
32 33 πœ‡π‘  13 πœ‡π‘ 
64 252 πœ‡π‘  80 πœ‡π‘ ...
@kuksenk0#JavaPerfAndHWC
Example 1: JMH, -prof perfnorm, N=256
IJK IKJ
multiply iteration multiply iteration
CPI 1.06 0.51...
@kuksenk0#JavaPerfAndHWC
Example 1: Free beer benefits!
cycles insts
...
6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x...
@kuksenk0#JavaPerfAndHWC
Example 1: Free beer benefits!
cycles insts
...
6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x...
@kuksenk0#JavaPerfAndHWC
Chapter 1
Performance Optimization Methodology:
A Very Short Introduction.
Three magic questions ...
@kuksenk0#JavaPerfAndHWC
Three magic questions
15/64
@kuksenk0#JavaPerfAndHWC
Three magic questions
What?
βˆ™ What prevents my application to work faster?
(monitoring)
15/64
@kuksenk0#JavaPerfAndHWC
Three magic questions
What? β‡’ Where?
βˆ™ What prevents my application to work faster?
(monitoring)
...
@kuksenk0#JavaPerfAndHWC
Three magic questions
What? β‡’ Where? β‡’ How?
βˆ™ What prevents my application to work faster?
(monit...
@kuksenk0#JavaPerfAndHWC
Top-Down Approach
βˆ™ System Level
– Network, Disk, OS, CPU/Memory
βˆ™ JVM Level
– GC/Heap, JIT, Clas...
@kuksenk0#JavaPerfAndHWC
Methodology in Essence
17/64
@kuksenk0#JavaPerfAndHWC
Methodology in Essence
http://j.mp/PerfMindMap
17/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
βˆ™ Lots of %sys β‡’ . . .
⇓
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
βˆ™ Lots of %sys β‡’ . . .
⇓
βˆ™ Lots of %irq, %...
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
βˆ™ Lots of %sys β‡’ . . .
⇓
βˆ™ Lots of %irq, %...
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
βˆ™ Lots of %sys β‡’ . . .
⇓
βˆ™ Lots of %irq, %...
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
βˆ™ Lots of %sys β‡’ . . .
⇓
βˆ™ Lots of %irq, %...
@kuksenk0#JavaPerfAndHWC
Chapter 2
High CPU Load
and
the main question:
Β«who/what is to blame?Β»
19/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
βˆ™ What does ∼100% CPU Utilization mean?
20/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
βˆ™ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
20/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
βˆ™ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can pro...
@kuksenk0#JavaPerfAndHWC
CPU Utilization
βˆ™ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can pro...
@kuksenk0#JavaPerfAndHWC
Who/What is to blame?
Complex CPU microarchitecture:
βˆ™ Inefficient algorithm β‡’ 100% CPU
βˆ™ Pipelin...
@kuksenk0#JavaPerfAndHWC
Chapter 3
Hardware Counters
HWC, PMU - WTF?
22/64
@kuksenk0#JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built ...
@kuksenk0#JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built ...
@kuksenk0#JavaPerfAndHWC
Events
Vendor’s documentation! (e.g. 2 pages from 32)
24/64
@kuksenk0#JavaPerfAndHWC
Events: Issues
βˆ™ Hundreds of events
βˆ™ Microarchitectural experience is required
βˆ™ Platform depend...
@kuksenk0#JavaPerfAndHWC
HWC
HWC modes:
βˆ™ Counting mode
– if (event_happened) ++counter;
– general monitoring (answering Β«...
@kuksenk0#JavaPerfAndHWC
HWC: Issues
So many events, so few counters
e.g. β€œNehalem”:
– over 900 events
– 7 HWC (3 fixed + ...
@kuksenk0#JavaPerfAndHWC
HWC: Issues (cont.)
Sampling mode:
βˆ™ Β«instruction skidΒ»
(hard to correlate event and instruction)...
@kuksenk0#JavaPerfAndHWC
HWC: typical usages
βˆ™ hardware validation
βˆ™ performance analysis
29/64
@kuksenk0#JavaPerfAndHWC
HWC: typical usages
βˆ™ hardware validation
βˆ™ performance analysis
βˆ™ run-time tuning (e.g. JRockit,...
@kuksenk0#JavaPerfAndHWC
HWC: tools
βˆ™ Oracle Solaris Studio Performance Analyzer
(http://www.oracle.com/technetwork/server...
@kuksenk0#JavaPerfAndHWC
HWC: tools (cont.)
βˆ™ AMD CodeXL
βˆ™ Intel Vtune Amplifier XE
βˆ™ etc. . .
31/64
@kuksenk0#JavaPerfAndHWC
perf events (e.g.)
βˆ™ cycles
βˆ™ instrustions
βˆ™ cache-references
βˆ™ cache-misses
βˆ™ branches
βˆ™ branch-...
@kuksenk0#JavaPerfAndHWC
Oracle Studio events (e.g.)
βˆ™ cycles
βˆ™ insts
βˆ™ branch-instruction-retired
βˆ™ branch-misses-retired...
@kuksenk0#JavaPerfAndHWC
Chapter 4
I’ve got HWC data. What’s next?
or
Introduction to microarchitecture
performance analys...
@kuksenk0#JavaPerfAndHWC
Execution time
π‘‘π‘–π‘šπ‘’ = 𝑐𝑦𝑐𝑙𝑒𝑠
𝑓 π‘Ÿπ‘’π‘žπ‘’π‘’π‘›π‘π‘¦
Optimization is. . .
reducing the (spent) cycle count!
⋆
...
@kuksenk0#JavaPerfAndHWC
Microarchitecture Equation
𝑐𝑦𝑐𝑙𝑒𝑠 = 𝑃 π‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 𝐢𝑃 𝐼 = 𝑃 π‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 1
𝐼𝑃 𝐢
βˆ™ PathLength - numb...
@kuksenk0#JavaPerfAndHWC
𝑃 π‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 𝐢𝑃 𝐼
βˆ™ PathLength ∼ algorithm efficiency (the smaller the better)
βˆ™ CPI ∼ CPU effic...
@kuksenk0#JavaPerfAndHWC
What to do?
βˆ™ low CPI
– reduce PathLength β†’ Β«tune algorithmΒ»
βˆ™ high CPI β‡’ stalls
– memory stalls ...
@kuksenk0#JavaPerfAndHWC
High CPI: Memory bound
βˆ™ dTLB misses
βˆ™ L1,L2,L3,...,LN misses
βˆ™ NUMA: non-local memory access
βˆ™ m...
@kuksenk0#JavaPerfAndHWC
High CPI: Core bound
βˆ™ long latency operations: DIV, SQRT
βˆ™ FP assist: floating points denormal, ...
@kuksenk0#JavaPerfAndHWC
High CPI: Front-End bound
βˆ™ iTLB miss
βˆ™ iCache miss
βˆ™ branch mispredict
βˆ™ LSD (loop stream decode...
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
Exam
p
les
@kuksenk0#JavaPerfAndHWC
Example 2
Some Β«largeΒ» standard server-side
Java benchmark
43/64
@kuksenk0#JavaPerfAndHWC
Example 2: perf stat -d
880851.993237 task -clock (msec)
39,318 context -switches
437 cpu -migrat...
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles Γ—109
2277
instructions Γ—109
1226
L1-dcache-load...
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles Γ—109
2277
instructions Γ—109
1226
L1-dcache-load...
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles Γ—109
2277
instructions Γ—109
1226
L1-dcache-load...
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles Γ—109
2277
instructions Γ—109
1226
L1-dcache-load...
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles Γ—109
2277
instructions Γ—109
1226
L1-dcache-load...
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
βˆ™ a memory cache that stores recent tra...
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
βˆ™ a memory cache that stores recent tra...
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
IPC 0.54
cycles Γ—109
2277
instructions Γ—109
1226
L1-dcache-loads Γ—109
338
...
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
Fixing it:
βˆ™ Enable -XX:+UseLargePages
βˆ™ try to shrink application working...
@kuksenk0#JavaPerfAndHWC
Example 2: -XX:+UseLargePages
baseline large pages
IPC 0.54 0.64
cycles Γ—109
2277 2277
instructio...
@kuksenk0#JavaPerfAndHWC
Example 2: normalized per transaction
baseline large pages
IPC 0.54 0.64
cycles Γ—106
23.4 19.5
in...
@kuksenk0#JavaPerfAndHWC
Example 3
Β«A plague on both your housesΒ»
Β«++ on both your threadsΒ»
51/64
@kuksenk0#JavaPerfAndHWC
Example 3: False Sharing
@State(Scope.Group)
public static class StateBaseline {
int field0;
int ...
@kuksenk0#JavaPerfAndHWC
Example 3: Measure
resource
⋆
sharing padded
Same Core (HT) L1 9.5 4.9
Diff Cores (within socket)...
@kuksenk0#JavaPerfAndHWC
Example 3: What do HWC tell us?
Same Core Diff Cores Diff Sockets
sharing padded sharing padded s...
@kuksenk0#JavaPerfAndHWC
Example 3: Β«on a core and a prayerΒ»
in case of the single core(HT) we have to look into
MACHINE_C...
@kuksenk0#JavaPerfAndHWC
Example 3: Diff Cores
L2_STORE_LOCK_RQSTS - L2 RFOs breakdown
Diff Cores Diff Sockets
sharing pad...
@kuksenk0#JavaPerfAndHWC
Question!
To the audience
57/64
@kuksenk0#JavaPerfAndHWC
Example 3: Diff Cores
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
L...
@kuksenk0#JavaPerfAndHWC
Example 3: Some events count duration
For example:
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAN...
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
S
um
m
ary
@kuksenk0#JavaPerfAndHWC
Summary: Performance is easy
To achieve high performance:
βˆ™ You have to know your Application!
βˆ™ ...
@kuksenk0#JavaPerfAndHWC
Summary: Β«Here be dragonsΒ»
To achieve high performance:
βˆ™ You have to know your Application!
βˆ™ Yo...
@kuksenk0#JavaPerfAndHWC
Enlarge your knowledge with these simple tricks!
Reading list:
βˆ™ β€œComputer Architecture: A Quanti...
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
Th
an
k
you
!
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
Q
&
A
Upcoming SlideShare
Loading in …5
×

Java Performance: Speedup your application with hardware counters

2,503 views

Published on

A modern CPU looks under the cover to understand what performance. What about hardware for Java application performance? Let's understand Performance Monitoring Unit works, Hardware Counters & using them to speeding up Java apps.

Published in: Software

Java Performance: Speedup your application with hardware counters

  1. 1. @kuksenk0#JavaPerfAndHWC Java Performance: Speedup your applications with hardware counters Sergey Kuksenko @kuksenk0
  2. 2. @kuksenk0#JavaPerfAndHWC The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 2/64
  3. 3. @kuksenk0#JavaPerfAndHWC Intriguing Introductory Example Gotcha, here is my hot method or On Algorithmic O⋆ ptimizations ⋆ just-O 3/64
  4. 4. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return result; } 4/64
  5. 5. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return result; } 𝑂( 𝑁3)⋆ ⋆ big-O 4/64
  6. 6. @kuksenk0#JavaPerfAndHWC Example 1: "Ok Google" 𝑂( 𝑁2.81) 5/64
  7. 7. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code N multiply strassen 32 33 πœ‡π‘  64 252 πœ‡π‘  128 3019 πœ‡π‘  2119 πœ‡π‘  256 56 π‘šπ‘  15 π‘šπ‘  512 488 π‘šπ‘  111 π‘šπ‘  1024 3270 π‘šπ‘  785 π‘šπ‘  2048 32 𝑠 6 𝑠 4096 309 𝑠 42 𝑠 8192 2611 𝑠 293 𝑠 time/op 6/64
  8. 8. @kuksenk0#JavaPerfAndHWC Try the other way 7/64
  9. 9. @kuksenk0#JavaPerfAndHWC Example 1: JMH, -prof perfnorm, N=256⋆ per multiplication per iteration CPI 1.06 cycles 146 Γ— 106 8.7 instructions 137 Γ— 106 8.2 L1-loads 68 Γ— 106 4 L1-load-misses 33 Γ— 106 2 L1-stores 0.56 Γ— 106 L1-store-misses 38357 LLC-loads 26 Γ— 106 1.6 LLC-stores 11595 ⋆ ∼ 256Kb per matrix 8/64
  10. 10. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j] ; } R[i][j] = s; } } return result; } L1-load-misses 9/64
  11. 11. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j] ; } R[i][j] = s; } } return result; } 9/64
  12. 12. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiplyIKJ(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int k = 0; k < size; k++) { int aik = A[i][k]; for (int j = 0; j < size; j++) { R[i][j] += aik * B[k][j]; } } } return result; } 10/64
  13. 13. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code N multIJK strassenIJK multIKJ 32 33 πœ‡π‘  13 πœ‡π‘  64 252 πœ‡π‘  80 πœ‡π‘  128 3019 πœ‡π‘  2119 πœ‡π‘  464 πœ‡π‘  256 56 π‘šπ‘  15 π‘šπ‘  4 π‘šπ‘  512 488 π‘šπ‘  111 π‘šπ‘  29 π‘šπ‘  1024 3270 π‘šπ‘  785 π‘šπ‘  369 π‘šπ‘  2048 32 𝑠 6 𝑠 3.4 𝑠 4096 309 𝑠 42 𝑠 25 𝑠 8192 2611 𝑠 293 𝑠 210 𝑠 time/op 11/64
  14. 14. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code N multIJK strassenIJK multIKJ strassenIKJ 32 33 πœ‡π‘  13 πœ‡π‘  64 252 πœ‡π‘  80 πœ‡π‘  128 3019 πœ‡π‘  2119 πœ‡π‘  464 πœ‡π‘  256 56 π‘šπ‘  15 π‘šπ‘  4 π‘šπ‘  512 488 π‘šπ‘  111 π‘šπ‘  29 π‘šπ‘  25 π‘šπ‘  1024 3270 π‘šπ‘  785 π‘šπ‘  369 π‘šπ‘  192 π‘šπ‘  2048 32 𝑠 6 𝑠 3.4 𝑠 1.4 𝑠 4096 309 𝑠 42 𝑠 25 𝑠 10 𝑠 8192 2611 𝑠 293 𝑠 210 𝑠 73 𝑠 time/op 11/64
  15. 15. @kuksenk0#JavaPerfAndHWC Example 1: JMH, -prof perfnorm, N=256 IJK IKJ multiply iteration multiply iteration CPI 1.06 0.51 cycles 146 Γ— 106 8.7 9.7 Γ— 106 0.6 instructions 137 Γ— 106 8.2 19 Γ— 106 1.1 L1-loads 68 Γ— 106 4 5.4 Γ— 106 0.3 L1-load-misses 33 Γ— 106 2 1.1 Γ— 106 0.1 L1-stores 0.56 Γ— 106 2.7 Γ— 106 0.2 L1-store-misses 38357 8959 LLC-loads 26 Γ— 106 1.6 0.3 Γ— 106 LLC-stores 11595 3532 12/64
  16. 16. @kuksenk0#JavaPerfAndHWC Example 1: Free beer benefits! cycles insts ... 6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore 9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc 0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx 0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge ... 13/64
  17. 17. @kuksenk0#JavaPerfAndHWC Example 1: Free beer benefits! cycles insts ... 6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore 9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc 0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx 0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge ... Vectorization (SSE/AVX)! 13/64
  18. 18. @kuksenk0#JavaPerfAndHWC Chapter 1 Performance Optimization Methodology: A Very Short Introduction. Three magic questions and the direction of the journey 14/64
  19. 19. @kuksenk0#JavaPerfAndHWC Three magic questions 15/64
  20. 20. @kuksenk0#JavaPerfAndHWC Three magic questions What? βˆ™ What prevents my application to work faster? (monitoring) 15/64
  21. 21. @kuksenk0#JavaPerfAndHWC Three magic questions What? β‡’ Where? βˆ™ What prevents my application to work faster? (monitoring) βˆ™ Where does it hide? (profiling) 15/64
  22. 22. @kuksenk0#JavaPerfAndHWC Three magic questions What? β‡’ Where? β‡’ How? βˆ™ What prevents my application to work faster? (monitoring) βˆ™ Where does it hide? (profiling) βˆ™ How to stop it messing with performance? (tuning/optimizing) 15/64
  23. 23. @kuksenk0#JavaPerfAndHWC Top-Down Approach βˆ™ System Level – Network, Disk, OS, CPU/Memory βˆ™ JVM Level – GC/Heap, JIT, Classloading βˆ™ Application Level – Algorithms, Synchronization, Threading, API βˆ™ Microarchitecture Level – Caches, Data/Code alignment, CPU Pipeline Stalls 16/64
  24. 24. @kuksenk0#JavaPerfAndHWC Methodology in Essence 17/64
  25. 25. @kuksenk0#JavaPerfAndHWC Methodology in Essence http://j.mp/PerfMindMap 17/64
  26. 26. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) 18/64
  27. 27. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) βˆ™ Lots of %sys β‡’ . . . ⇓ 18/64
  28. 28. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) βˆ™ Lots of %sys β‡’ . . . ⇓ βˆ™ Lots of %irq, %soft β‡’ . . . ⇓ 18/64
  29. 29. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) βˆ™ Lots of %sys β‡’ . . . ⇓ βˆ™ Lots of %irq, %soft β‡’ . . . ⇓ βˆ™ Lots of %iowait β‡’ . . . ⇓ 18/64
  30. 30. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) βˆ™ Lots of %sys β‡’ . . . ⇓ βˆ™ Lots of %irq, %soft β‡’ . . . ⇓ βˆ™ Lots of %iowait β‡’ . . . ⇓ βˆ™ Lots of %idle β‡’ . . . ⇓ 18/64
  31. 31. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) βˆ™ Lots of %sys β‡’ . . . ⇓ βˆ™ Lots of %irq, %soft β‡’ . . . ⇓ βˆ™ Lots of %iowait β‡’ . . . ⇓ βˆ™ Lots of %idle β‡’ . . . ⇓ βˆ™ Lots of %user 18/64
  32. 32. @kuksenk0#JavaPerfAndHWC Chapter 2 High CPU Load and the main question: Β«who/what is to blame?Β» 19/64
  33. 33. @kuksenk0#JavaPerfAndHWC CPU Utilization βˆ™ What does ∼100% CPU Utilization mean? 20/64
  34. 34. @kuksenk0#JavaPerfAndHWC CPU Utilization βˆ™ What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule 20/64
  35. 35. @kuksenk0#JavaPerfAndHWC CPU Utilization βˆ™ What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? 20/64
  36. 36. @kuksenk0#JavaPerfAndHWC CPU Utilization βˆ™ What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? Profiler ⋆ shows Β«WHEREΒ» application time is spent, but there’s no answer to the question Β«WHYΒ». ⋆ traditional profiler 20/64
  37. 37. @kuksenk0#JavaPerfAndHWC Who/What is to blame? Complex CPU microarchitecture: βˆ™ Inefficient algorithm β‡’ 100% CPU βˆ™ Pipeline stall due to memory load/stores β‡’ 100% CPU βˆ™ Pipeline flush due to mispredicted branch β‡’ 100% CPU βˆ™ Expensive instructions β‡’ 100% CPU βˆ™ Insufficient ILP ⋆ β‡’ 100% CPU βˆ™ etc. β‡’ 100% CPU ⋆ Instruction Level Parallelism 21/64
  38. 38. @kuksenk0#JavaPerfAndHWC Chapter 3 Hardware Counters HWC, PMU - WTF? 22/64
  39. 39. @kuksenk0#JavaPerfAndHWC PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. 23/64
  40. 40. @kuksenk0#JavaPerfAndHWC PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. PMU Internals (in less than 21 seconds) : Hardware counters (HWC) count hardware performance events (performance monitoring events) 23/64
  41. 41. @kuksenk0#JavaPerfAndHWC Events Vendor’s documentation! (e.g. 2 pages from 32) 24/64
  42. 42. @kuksenk0#JavaPerfAndHWC Events: Issues βˆ™ Hundreds of events βˆ™ Microarchitectural experience is required βˆ™ Platform dependent – vary from CPU vendor to vendor – may vary when CPU manufacturer introduces new microarchitecture βˆ™ How to work with HWC? 25/64
  43. 43. @kuksenk0#JavaPerfAndHWC HWC HWC modes: βˆ™ Counting mode – if (event_happened) ++counter; – general monitoring (answering Β«WHAT?Β» ) βˆ™ Sampling mode – if (event_happened) if (++counter < threshold) INTERRUPT; – profiling (answering Β«WHERE?Β») 26/64
  44. 44. @kuksenk0#JavaPerfAndHWC HWC: Issues So many events, so few counters e.g. β€œNehalem”: – over 900 events – 7 HWC (3 fixed + 4 programmable) βˆ™ Β«multiple-runningΒ» (different events) – Repeatability βˆ™ Β«multiplexingΒ» (only a few tools are able to do that) – Steady state 27/64
  45. 45. @kuksenk0#JavaPerfAndHWC HWC: Issues (cont.) Sampling mode: βˆ™ Β«instruction skidΒ» (hard to correlate event and instruction) βˆ™ Uncore events (hard to bind event and execution thread; e.g. shared L3 cache) 28/64
  46. 46. @kuksenk0#JavaPerfAndHWC HWC: typical usages βˆ™ hardware validation βˆ™ performance analysis 29/64
  47. 47. @kuksenk0#JavaPerfAndHWC HWC: typical usages βˆ™ hardware validation βˆ™ performance analysis βˆ™ run-time tuning (e.g. JRockit, etc.) βˆ™ security attacks and defenses βˆ™ test code coverage βˆ™ etc. 29/64
  48. 48. @kuksenk0#JavaPerfAndHWC HWC: tools βˆ™ Oracle Solaris Studio Performance Analyzer (http://www.oracle.com/technetwork/server-storage/solarisstudio) βˆ™ perf/perf_events (http://perf.wiki.kernel.org) βˆ™ JMH (http://openjdk.java.net/projects/code-tools/jmh/) -prof perf -prof perfnorm = perf, normalized per operation -prof perfasm = perf + -XX:+PrintAssembly 30/64
  49. 49. @kuksenk0#JavaPerfAndHWC HWC: tools (cont.) βˆ™ AMD CodeXL βˆ™ Intel Vtune Amplifier XE βˆ™ etc. . . 31/64
  50. 50. @kuksenk0#JavaPerfAndHWC perf events (e.g.) βˆ™ cycles βˆ™ instrustions βˆ™ cache-references βˆ™ cache-misses βˆ™ branches βˆ™ branch-misses βˆ™ bus-cycles βˆ™ ref-cycles βˆ™ dTLB-loads βˆ™ dTLB-load-misses βˆ™ L1-dcache-loads βˆ™ L1-dcache-load-misses βˆ™ L1-dcache-stores βˆ™ L1-dcache-store-misses βˆ™ LLC-loads βˆ™ etc... 32/64
  51. 51. @kuksenk0#JavaPerfAndHWC Oracle Studio events (e.g.) βˆ™ cycles βˆ™ insts βˆ™ branch-instruction-retired βˆ™ branch-misses-retired βˆ™ dtlbm βˆ™ l1h βˆ™ l1m βˆ™ l2h βˆ™ l2m βˆ™ l3h βˆ™ l3m βˆ™ etc... 33/64
  52. 52. @kuksenk0#JavaPerfAndHWC Chapter 4 I’ve got HWC data. What’s next? or Introduction to microarchitecture performance analysis 34/64
  53. 53. @kuksenk0#JavaPerfAndHWC Execution time π‘‘π‘–π‘šπ‘’ = 𝑐𝑦𝑐𝑙𝑒𝑠 𝑓 π‘Ÿπ‘’π‘žπ‘’π‘’π‘›π‘π‘¦ Optimization is. . . reducing the (spent) cycle count! ⋆ ⋆ everything else is overclocking 35/64
  54. 54. @kuksenk0#JavaPerfAndHWC Microarchitecture Equation 𝑐𝑦𝑐𝑙𝑒𝑠 = 𝑃 π‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 𝐢𝑃 𝐼 = 𝑃 π‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 1 𝐼𝑃 𝐢 βˆ™ PathLength - number of instructions βˆ™ CPI - cycles per instruction βˆ™ IPC - instructions per cycle 36/64
  55. 55. @kuksenk0#JavaPerfAndHWC 𝑃 π‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 𝐢𝑃 𝐼 βˆ™ PathLength ∼ algorithm efficiency (the smaller the better) βˆ™ CPI ∼ CPU efficiency (the smaller the better) – 𝐢𝑃 𝐼 = 4 – bad! – 𝐢𝑃 𝐼 = 1 βˆ™ Nehalem – just good enough! βˆ™ SandyBridge and later – not so good! – 𝐢𝑃 𝐼 = 0.4 – good! – 𝐢𝑃 𝐼 = 0.2 – ideal! 37/64
  56. 56. @kuksenk0#JavaPerfAndHWC What to do? βˆ™ low CPI – reduce PathLength β†’ Β«tune algorithmΒ» βˆ™ high CPI β‡’ stalls – memory stalls β†’ Β«tune data structuresΒ» – branch stalls β†’ Β«tune control logicΒ» – instruction dependency β†’ Β«break dependency chainsΒ» – long latency ops β†’ Β«use more simple operationsΒ» – etc. . . 38/64
  57. 57. @kuksenk0#JavaPerfAndHWC High CPI: Memory bound βˆ™ dTLB misses βˆ™ L1,L2,L3,...,LN misses βˆ™ NUMA: non-local memory access βˆ™ memory bandwidth βˆ™ false/true sharing βˆ™ cache line split (not in Java world, except. . . ) βˆ™ store forwarding (unlikely, hard to fix on Java level) βˆ™ 4k aliasing 39/64
  58. 58. @kuksenk0#JavaPerfAndHWC High CPI: Core bound βˆ™ long latency operations: DIV, SQRT βˆ™ FP assist: floating points denormal, NaN, inf βˆ™ bad speculation (caused by mispredicted branch) βˆ™ port saturation (forget it) 40/64
  59. 59. @kuksenk0#JavaPerfAndHWC High CPI: Front-End bound βˆ™ iTLB miss βˆ™ iCache miss βˆ™ branch mispredict βˆ™ LSD (loop stream decoder) solvable by HotSpot tweaking 41/64
  60. 60. @kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC Exam p les
  61. 61. @kuksenk0#JavaPerfAndHWC Example 2 Some Β«largeΒ» standard server-side Java benchmark 43/64
  62. 62. @kuksenk0#JavaPerfAndHWC Example 2: perf stat -d 880851.993237 task -clock (msec) 39,318 context -switches 437 cpu -migrations 7,931 page -faults 2 ,277 ,063 ,113 ,376 cycles 1 ,226 ,299 ,634 ,877 instructions # 0.54 insns per cycle 229 ,500 ,265 ,931 branches # 260.544 M/sec 4 ,620 ,666 ,169 branch -misses # 2.01% of all branches 338 ,169 ,489 ,902 L1-dcache -loads # 383.912 M/sec 37 ,937 ,596 ,505 L1-dcache -load -misses # 11.22% of all L1-dcache hits 25 ,232 ,434 ,666 LLC -loads # 28.645 M/sec 7 ,307 ,884 ,874 L1-icache -load -misses # 0.00% of all L1 -icache hits 337 ,730 ,278 ,697 dTLB -loads # 382.846 M/sec 6 ,094 ,356 ,801 dTLB -load -misses # 1.81% of all dTLB cache hits 12 ,210 ,841 ,909 iTLB -loads # 13.863 M/sec 431 ,803 ,270 iTLB -load -misses # 3.54% of all iTLB cache hits 301.557213044 seconds time elapsed 44/64
  63. 63. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles Γ—109 2277 instructions Γ—109 1226 L1-dcache-loads Γ—109 338 L1-dcache-load-misses Γ—109 38 LLC-loads Γ—109 25 dTLB-loads Γ—109 338 dTLB-load-misses Γ—109 6 45/64
  64. 64. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles Γ—109 2277 instructions Γ—109 1226 L1-dcache-loads Γ—109 338 L1-dcache-load-misses Γ—109 38 LLC-loads Γ—109 25 dTLB-loads Γ—109 338 dTLB-load-misses Γ—109 6 Potential Issues? 45/64
  65. 65. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles Γ—109 2277 instructions Γ—109 1226 L1-dcache-loads Γ—109 338 L1-dcache-load-misses Γ—109 38 LLC-loads Γ—109 25 dTLB-loads Γ—109 338 dTLB-load-misses Γ—109 6 Let’s count: 4 Γ— (338 Γ— 109 )+ 12 Γ— ((38 βˆ’ 25) Γ— 109 )+ 36 Γ— (25 Γ— 109 ) β‰ˆ 2408 Γ— 109 cycles ??? Potential Issues? 45/64
  66. 66. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles Γ—109 2277 instructions Γ—109 1226 L1-dcache-loads Γ—109 338 L1-dcache-load-misses Γ—109 38 LLC-loads Γ—109 25 dTLB-loads Γ—109 338 dTLB-load-misses Γ—109 6 Let’s count: 4 Γ— (338 Γ— 109 )+ 12 Γ— ((38 βˆ’ 25) Γ— 109 )+ 36 Γ— (25 Γ— 109 ) β‰ˆ 2408 Γ— 109 cycles ??? Superscalar in Action! 45/64
  67. 67. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles Γ—109 2277 instructions Γ—109 1226 L1-dcache-loads Γ—109 338 L1-dcache-load-misses Γ—109 38 LLC-loads Γ—109 25 dTLB-loads Γ—109 338 dTLB-load-misses Γ—109 6 Issue! 45/64
  68. 68. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses TLB = Translation Lookaside Buffer βˆ™ a memory cache that stores recent translations of virtual memory addresses to physical addresses βˆ™ each memory access β†’ access to TLB βˆ™ TLB miss may take hundreds cycles 46/64
  69. 69. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses TLB = Translation Lookaside Buffer βˆ™ a memory cache that stores recent translations of virtual memory addresses to physical addresses βˆ™ each memory access β†’ access to TLB βˆ™ TLB miss may take hundreds cycles How to check? βˆ™ dtlb_load_misses_miss_causes_a_walk βˆ™ dtlb_load_misses_walk_duration 46/64
  70. 70. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses IPC 0.54 cycles Γ—109 2277 instructions Γ—109 1226 L1-dcache-loads Γ—109 338 L1-dcache-load-misses Γ—109 38 LLC-loads Γ—109 25 dTLB-loads Γ—109 338 dTLB-load-misses Γ—109 6 dTLB-walks-duration ⋆ Γ—109 296 13%of cycles (time) ⋆ in cycles 47/64
  71. 71. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses Fixing it: βˆ™ Enable -XX:+UseLargePages βˆ™ try to shrink application working set 48/64
  72. 72. @kuksenk0#JavaPerfAndHWC Example 2: -XX:+UseLargePages baseline large pages IPC 0.54 0.64 cycles Γ—109 2277 2277 instructions Γ—109 1226 1460 L1-dcache-loads Γ—109 338 401 L1-dcache-load-misses Γ—109 38 38 LLC-loads Γ—109 25 25 dTLB-loads Γ—109 338 401 dTLB-load-misses Γ—109 6 0.24 dTLB-walks-duration Γ—109 296 2.6 Boost – 20% 49/64
  73. 73. @kuksenk0#JavaPerfAndHWC Example 2: normalized per transaction baseline large pages IPC 0.54 0.64 cycles Γ—106 23.4 19.5 instructions Γ—106 12.5 12.5 L1-dcache-loads Γ—106 3.45 3.45 L1-dcache-load-misses Γ—106 0.39 0.33 LLC-loads Γ—106 0.26 0.21 dTLB-loads Γ—106 3.45 3.45 dTLB-load-misses Γ—106 0.06 0.002 dTLB-walks-duration Γ—106 3.04 0.022 50/64
  74. 74. @kuksenk0#JavaPerfAndHWC Example 3 Β«A plague on both your housesΒ» Β«++ on both your threadsΒ» 51/64
  75. 75. @kuksenk0#JavaPerfAndHWC Example 3: False Sharing @State(Scope.Group) public static class StateBaseline { int field0; int field1; } @Benchmark @Group("baseline") public int justDoIt(StateBaseline s) { return s.field0 ++; } @Benchmark @Group("baseline") public int doItFromOtherThread(StateBaseline s) { return s.field1 ++; } 52/64
  76. 76. @kuksenk0#JavaPerfAndHWC Example 3: Measure resource ⋆ sharing padded Same Core (HT) L1 9.5 4.9 Diff Cores (within socket) L3 (LLC) 10.6 2.8 Diff Sockets nothing 18.2 2.8 average time, ns/op ⋆ shared between threads 53/64
  77. 77. @kuksenk0#JavaPerfAndHWC Example 3: What do HWC tell us? Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 L1-loads 12593 9467 9696 8973 9672 9016 L1-load-misses 10 5 12 4 33 3 L1-stores 4317 7838 7433 4069 6935 4074 L1-store-misses 5 2 161 2 55 1 LLΠ‘-loads 4 3 58 1 32 1 LLC-load-misses 1 1 53 β‰ˆ0 35 β‰ˆ0 LLC-stores 1 1 183 β‰ˆ0 49 β‰ˆ0 LLC-store-misses 1 β‰ˆ0 182 β‰ˆ0 48 β‰ˆ0 ⋆ All values are normalized per 103 operations 54/64
  78. 78. @kuksenk0#JavaPerfAndHWC Example 3: Β«on a core and a prayerΒ» in case of the single core(HT) we have to look into MACHINE_CLEARS.MEMORY_ORDERING Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 CLEARS 238 β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0 55/64
  79. 79. @kuksenk0#JavaPerfAndHWC Example 3: Diff Cores L2_STORE_LOCK_RQSTS - L2 RFOs breakdown Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 β‰ˆ0 49 β‰ˆ0 LLC-store-misses 182 β‰ˆ0 48 β‰ˆ0 L2_STORE_LOCK_RQSTS.MISS 134 β‰ˆ0 33 β‰ˆ0 L2_STORE_LOCK_RQSTS.HIT_E β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0 L2_STORE_LOCK_RQSTS.HIT_M β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0 L2_STORE_LOCK_RQSTS.ALL 183 β‰ˆ0 49 β‰ˆ0 56/64
  80. 80. @kuksenk0#JavaPerfAndHWC Question! To the audience 57/64
  81. 81. @kuksenk0#JavaPerfAndHWC Example 3: Diff Cores Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 β‰ˆ0 49 β‰ˆ0 LLC-store-misses 182 β‰ˆ0 48 β‰ˆ0 L2_STORE_LOCK_RQSTS.MISS 134 β‰ˆ0 33 β‰ˆ0 L2_STORE_LOCK_RQSTS.ALL 183 β‰ˆ0 49 β‰ˆ0 Why 183 > 49 & 134 > 33, but the same socket case is faster? 58/64
  82. 82. @kuksenk0#JavaPerfAndHWC Example 3: Some events count duration For example: OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 cycles 36012 9163 46484 9608 instructions 26550 25747 26717 25768 O_R_O.CYCLES_W_D_RFO 21723 10 29601 56 59/64
  83. 83. @kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC S um m ary
  84. 84. @kuksenk0#JavaPerfAndHWC Summary: Performance is easy To achieve high performance: βˆ™ You have to know your Application! βˆ™ You have to know your Frameworks! βˆ™ You have to know your Virtual Machine! βˆ™ You have to know your Operating System! βˆ™ You have to know your Hardware! 61/64
  85. 85. @kuksenk0#JavaPerfAndHWC Summary: Β«Here be dragonsΒ» To achieve high performance: βˆ™ You have to know your Application! βˆ™ You have to know your Frameworks! βˆ™ You have to know your Virtual Machine! βˆ™ You have to know your Operating System! βˆ™ You have to know your Hardware! 61/64
  86. 86. @kuksenk0#JavaPerfAndHWC Enlarge your knowledge with these simple tricks! Reading list: βˆ™ β€œComputer Architecture: A Quantitative Approach” John L. Hennessy, David A. Patterson βˆ™ CPU vendors documentation βˆ™ http://www.agner.org/optimize/ βˆ™ http://www.google.com/search?q=Hardware+ performance+counter βˆ™ etc. . . 62/64
  87. 87. @kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC Th an k you !
  88. 88. @kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC Q & A

Γ—