Copyright © 2018, Oracle and/or its afliates. All rights reserved.
“Quantum” Performance Effects:
Beyond The Core
Sergey Kuksenko
Java Platform Group, Oracle
October, 2018
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
cpu-microarchitecture
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Safe Harbor Statement
The following is intended to outline our general product directon. It is intended for
informaton purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functonality, and should not be relied upon
in making purchasing decisions. The development, release, tming, and pricing of any
features or functonality described for Oracle’s products may change and remains at the
sole discreton of Oracle Corporaton.
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
About me
• Java/JVM Performance Engineer at Oracle, @since 2010
• Java/JVM Performance Engineer, @since 2005
• Java/JVM Engineer, @since 1996
3
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
System Under Test
• Intel
® Core
— i5-5300U [2.3 GHz] 1x2x2
– μarch: Haswell
– launched: Q1’2015s
• OS: Xubuntu 18.04 (64-bits) (4.15.0-36-generic)
• Java 8 (64-bits)
• Java 11 (64-bits)
4
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo code
https://github.com/kuksenko/quantum2
5
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo code
https://github.com/kuksenko/quantum2
• Required: JMH (Java Microbenchmark Harness)
– http://openjdk.java.net/projects/code-tools/jmh/
5
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: How to copy 2 Mbytes.
6
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1
int[] a = new int[512*1024];
int[] b = new int[512*1024];
@Benchmark
public void arraycopy() {
System.arraycopy(a, 0, b, 0, a.length);
}
@Benchmark
public void reversecopy() {
for(int i = a.length - 1; i >= 0; i--) {
b[i] = a[i];
}
}
7
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1
int[] a = new int[512*1024];
int[] b = new int[512*1024];
@Benchmark
public void arraycopy() {
System.arraycopy(a, 0, b, 0, a.length);
}
@Benchmark
public void reversecopy() {
for(int i = a.length - 1; i >= 0; i--) {
b[i] = a[i];
}
}
740 μs
300 μs
??
* Using Java 8
7
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Conclusions?
• Oracle engineers - rubbish!
– I know how to copy faster!
8
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Conclusions?
• Oracle engineers - rubbish!
– I know how to copy faster!
8
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Shared results within team
• What I got:
– arraycopy vs reversecopy: 740 vs 300 μs
9
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Shared results within team
• What I got:
– arraycopy vs reversecopy: 740 vs 300 μs
• What Bob got (on some MacBook Pro):
– arraycopy vs reversecopy: 190 vs 185 μs
9
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Shared results within team
• What I got:
– arraycopy vs reversecopy: 740 vs 300 μs
• What Bob got (on some MacBook Pro):
– arraycopy vs reversecopy: 190 vs 185 μs
• What Alice got (she already migrated to JDK11):
– arraycopy vs reversecopy: 270 vs 280 μs
9
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Shared results within team
• What I got:
– arraycopy vs reversecopy: 740 vs 300 μs
• What Bob got (on some MacBook Pro):
– arraycopy vs reversecopy: 190 vs 185 μs
• What Alice got (she already migrated to JDK11):
– arraycopy vs reversecopy: 270 vs 280 μs
• What if copy less data ”2Mbytes - 32 bytes”:
– arraycopy vs reversecopy: 280 vs 720 μs
9
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
spent a billion on research
• MacOS doesn’t support ”Large Pages”!
– Ubuntu - ”Transparent Huge Pages”
• G1 is default GC since Java 9!
– Java 8 default GC - ”ParallelOld”
10
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
spent a billion on research
• MacOS doesn’t support ”Large Pages”!
– Ubuntu - ”Transparent Huge Pages”
• G1 is default GC since Java 9!
– Java 8 default GC - ”ParallelOld”
Conclusions:
• Large Pages - Rubbish!
• G1 GC - Cool!
10
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
spent a billion on research
• MacOS doesn’t support ”Large Pages”!
– Ubuntu - ”Transparent Huge Pages”
• G1 is default GC since Java 9!
– Java 8 default GC - ”ParallelOld”
Conclusions:
• Large Pages - Rubbish!
• G1 GC - Cool!
10
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
To Be Continued ...
11
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: How many data?
12
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: The Last Jedi Refactoring
public class MyData {
private byte[] bytes;
private int length;
public MyData(int length) {
this.bytes = new byte[length];
this.length = length;
}
public int length() { return length; }
public byte[] bytes() { return bytes; }
}
13
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: dataSize(MyData)
MyData[] data = new MyData[256];
@Setup
public void setup() {
Random rnd = new Random();
Arrays.setAll(data, i -> new MyData(512 * 1024 + rnd.nextInt(64 * 1024)));
}
@Benchmark
public int dataSize() {
int s = 0;
for (MyData a : data) {
s += a.length();
}
return s;
}
14
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: dataSize(byte[])
byte[][] data = new byte[256][];
@Setup
public void setup() {
Random rnd = new Random();
Arrays.setAll(data, i -> new byte[512 * 1024 + rnd.nextInt(64 * 1024)]);
}
@Benchmark
public int dataSize() {
int s = 0;
for (byte[] a : data) {
s += a.length;
}
return s;
}
15
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: results (Java 8)
DataSize(MyData) 145 ns
DataSize(byte[]) 200 ns
?
16
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: results (Java 8)
DataSize(MyData) 145 ns
DataSize(byte[]) 200 ns
What if turn on G1? (-XX:+UseG1GC)
?
16
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: results (Java 8)
DataSize(MyData) 145 ns
DataSize(byte[]) 200 ns
What if turn on G1? (-XX:+UseG1GC)
DataSize(MyData) 145 ns
DataSize(byte[]) 13045 ns
?
???
16
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: results
What if turn off ”Large Pages”?
ParallelOld GC:
DataSize(MyData) 145 ns
DataSize(byte[]) 250 ns
G1 GC:
DataSize(MyData) 145 ns
DataSize(byte[]) 635 ns
??
17
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: Conclusions
Conclusions:
• Large Pages - Rubbish!
• G1 GC - Rubbish!
18
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: Conclusions
Conclusions:
• Large Pages - Rubbish!
• G1 GC - Rubbish!
18
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
To Be Continued ...
19
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Why we are here?
20
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Caches, caches everywhere
21
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Caches in numbers (Intel Core i5-5300U)
L1 - 32K, 8-way, latency: 4 cycles
L2 - 256K, 8-way, latency: 12 cycles
L3 - 3M, 12-way, latency: 35(and more) cycles
- cache line - 64 bytes
22
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: memory access cost.
23
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: walking on memory
Node root;
@Benchmark
@OperationsPerInvocation(COUNT)
public int walk() {
return forward(root, COUNT);
}
public int forward(Node node, int cnt) {
for(int i=0; i < cnt; i++) {
node = node.next;
}
return node.value;
}
24
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: walking on memory
2.2 ns
5.2 ns
15.4 ns
35 ns
25
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: walking on memory
What about HW prefetching?
26
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: different mix
27
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: different mix
28
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: different mix
2.7x
29
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: different mix
4x
30
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 3: different mix
12.6x
31
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: to split or not to split?
32
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Good old Unsafe!
Unsafe UNSAFE;
long from; // page alignment
@Param({"-8", "-4", "-2", "0", "2", "4", "8" })
int offset; // offset in bytes
@Benchmark
public long getlong() {
return UNSAFE.getLong(a, from + offset);
}
@Benchmark
public void putlong() {
UNSAFE.putLong(a, from + offset, 42L);
}
33
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Results
offset getlong putlong
-8 5.0 1.8
-4 19.1 17.8
0 5.0 1.8
60 5.2 2.5
64 5.0 1.8
time, ns/op
34
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Results
offset getlong putlong
-8 5.0 1.8
-4 19.1 17.8
0 5.0 1.8
60 5.2 2.5
64 5.0 1.8
time, ns/op
unaligned data:
Page Split!
Line Split!
34
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Misalignment
But wait!
Java doesn’t have misaligned data!
35
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Misalignment
But wait!
Java doesn’t have misaligned data!
There are no misaligned data,
but there are misaligned operations.
35
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Misalignment
Java misaligned access:
• Unsafe/VarHandle
– Buffers
– Offheap
• SIMD instructions (SSE, AVX ...)
– HotSpot intrinsics (System.arraycopy, Arrays.fill ...)
– Automatic vectorization
36
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Arrays.fill
int from; // alignment to page boundary
int size;
int offset;
byte[] a;
@Benchmark
public void fill() {
Arrays.fill(a, from + offset, from + offset + size, (byte)42);
}
37
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Arrays.fill, 512 bytes
38
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 4: Arrays.fill, 512 bytes
39
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: upside down
40
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: matrix transpose
int size;
double[][] matrix = new double[size][size];
@Benchmark
public void transpose() {
for (int i = 1; i < size; i++) {
for (int j = 0; j < i; j++) {
double tmp = matrix[i][j];
matrix[i][j] = matrix[j][i];
matrix[j][i] = tmp;
}
}
}
41
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: results (NxN)
N
N+0
N+1
N+2
N+3 80 μs
42
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: results (NxN)
N
N+0
N+1
N+2 94 μs
N+3 80 μs
42
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: results (NxN)
N
N+0 88 μs
N+1
N+2 94 μs
N+3 80 μs
42
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: results (NxN)
N
N+0 88 μs
N+1 350 μs
N+2 94 μs
N+3 80 μs
42
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: results (NxN)
N
253 88 μs
254 350 μs
255 94 μs
256 80 μs
42
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 5: matrix transpose
43
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Cache Associativity
44
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Critical Stride
〈Crtc Strde〉 =
〈Cche Sze〉
〈Assoctty〉
• L1 (32K, 8-way) ⇒ 4K
• L2 (256K, 8-way) ⇒ 32K
• L3 (3M, 12-way) ⇒ 256K
45
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 2: How many data?(cont.)
46
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
”critical stride” hit
Let’s count:
• G1 GC
– all arrays are aligned to 1M (256K, 32K, 4K)
• ParallelOld GC
– 256 arrays ⇒ 254 different ”index sets” в L3
– 256 arrays ⇒ 251 different ”index sets” в L2
– 256 arrays ⇒ 62 different ”index sets” в L1
– number of hits to L1 index sets:
10, 9, 8, 8, 8, 7, 7...
47
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 6: the rich get richer
48
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 6: Walking dead threads
@Benchmark
@Group("pair")
@OperationsPerInvocation(COUNT)
public int bob() {
return forward(root, COUNT);
}
@Benchmark
@Group("pair")
@OperationsPerInvocation(COUNT)
public int alice() {
return forward(root, COUNT);
}
Each thread has it’s own root
and independent data.
49
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 6: 128K per thread
Iteration 1:
bob: 5.246 ns/op
alice: 5.241 ns/op
Iteration 2:
bob: 5.254 ns/op
alice: 5.272 ns/op
Iteration 3:
bob: 5.233 ns/op
alice: 5.244 ns/op
Iteration 4:
bob: 5.244 ns/op
alice: 5.232 ns/op
50
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 6: 1M per thread
Iteration 1:
bob: 14.495 ns/op
alice: 14.614 ns/op
Iteration 2:
bob: 14.289 ns/op
alice: 14.331 ns/op
Iteration 3:
bob: 14.242 ns/op
alice: 14.296 ns/op
Iteration 4:
bob: 14.332 ns/op
alice: 14.332 ns/op
51
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 6: 2M per thread
Iteration 1:
bob: 17.199 ns/op
alice: 48.845 ns/op
Iteration 2:
bob: 46.777 ns/op
alice: 20.850 ns/op
Iteration 3:
bob: 17.046 ns/op
alice: 48.686 ns/op
Iteration 4:
bob: 46.422 ns/op
alice: 20.704 ns/op
Fight for LLC!
52
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 6: 2M per thread
Iteration 1:
bob: 17.199 ns/op
alice: 48.845 ns/op
Iteration 2:
bob: 46.777 ns/op
alice: 20.850 ns/op
Iteration 3:
bob: 17.046 ns/op
alice: 48.686 ns/op
Iteration 4:
bob: 46.422 ns/op
alice: 20.704 ns/op
Fight for LLC!
∼ 1 ≤
〈Tot Workng Set〉
〈LLC sze〉
≤∼ 2.5
52
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 7: Bytes histogram
53
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 7: count bytes frequency
byte[] source; // SIZE == 16 * K;
@Benchmark
public int[] count1() {
int[] table = new int[256];
for (byte v : source) {
table[v & 0xFF]++;
}
return table;
}
54
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 7: count bytes frequency
byte[] source; // SIZE == 16 * K;
@Benchmark
public int[] count1() {
int[] table = new int[256];
for (byte v : source) {
table[v & 0xFF]++;
}
return table;
}
13.7 μs
54
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 7: count bytes frequency
byte[] source; // SIZE == 16 * K;
@Benchmark
public int[] count1() {
int[] table = new int[256];
for (byte v : source) {
table[v & 0xFF]++;
}
return table;
}
13.7 μs
What if the data is unevenly distributed?
54
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Results
55
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Store Buffer
CPU
L1 Cache
4-5
clocks
56
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Store Buffer
CPU
L1 Cache
4-5
clocks
Store
Buffer
56
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Store Forwarding
Store A;
Load B;
Load A;
57
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Store Forwarding
Store A;
Load B;
Load A;
* Store Buffer
Store A;
57
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Store Forwarding
Store A;
Load B;
Load A;
*
Store Buffer
Store A;
No “B” in Store Buffer
Execute!
even before Store
57
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Store Forwarding
Store A;
Load B;
Load A;*
Store Buffer
Store A;
“A” exists in
Store Buffer What to do?
57
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Hit to “Store Buffer”
• Wait until “Store A” reaches L1 (expensive)
• Take value from Store Buffer (a.k.a. “Store Forwarding”)
58
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Let’s do this
@Benchmark
public int[] count2() {
int[] table0 = new int[256];
int[] table1 = new int[256];
for (int i = 0; i < source.length; ) {
table0[source[i++] & 0xFF]++;
table1[source[i++] & 0xFF]++;
}
for (int i = 0; i < 256; i++) {
table0[i] += table1[i];
}
return table0;
}
59
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
... and this
@Benchmark
public int[] count4() {
int[] table0 = new int[256];
int[] table1 = new int[256];
int[] table2 = new int[256];
int[] table3 = new int[256];
for (int i = 0; i < source.length; ) {
table0[source[i++] & 0xFF]++;
table1[source[i++] & 0xFF]++;
table2[source[i++] & 0xFF]++;
table3[source[i++] & 0xFF]++;
}
for (int i = 0; i < 256; i++) {
table0[i] += table1[i] + table2[i] + table3[i];
}
return table0;
}
60
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Results
61
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 8: bytes ⇔ int
62
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 8: bytes ⇔ int
ByteBuffer buf = ByteBuffer.allocateDirect(4);
@Benchmark
public int bytesToInt() {
buf.put(0, b0);
buf.put(1, b1);
buf.put(2, b2);
buf.put(3, b3);
return buf.getInt(0);
}
@Benchmark
public int intToBytes() {
buf.putInt(0, i0);
return buf.get(0) + buf.get(1) +
buf.get(2) + buf.get(3);
}
63
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 8: bytes ⇔ int
ByteBuffer buf = ByteBuffer.allocateDirect(4);
@Benchmark
public int bytesToInt() {
buf.put(0, b0);
buf.put(1, b1);
buf.put(2, b2);
buf.put(3, b3);
return buf.getInt(0);
}
@Benchmark
public int intToBytes() {
buf.putInt(0, i0);
return buf.get(0) + buf.get(1) +
buf.get(2) + buf.get(3);
}
13.2 ns
7.9 ns
63
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 8: Store Forwarding success
int
byte
byte
byte
byte
Store
Load
Load
Load
Load
64
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 8: Store Forwarding fail
int
byte
byte
byte
byte
Store
Store
Store
Store
Load
65
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: back to arraycopy
66
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: looking into asm
loop: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
vmovdqu %ymm0,-0x38(%rsi,%rdx,8)
vmovdqu -0x18(%rdi,%rdx,8),%ymm1
vmovdqu %ymm1,-0x18(%rsi,%rdx,8)
add $0x8,%rdx
jle loop
loop: vmovdqu -0xc(%r8,%rbx,4),%ymm0
vmovdqu %ymm0,-0xc(%r10,%rbx,4)
add $0xfffffff8,%ebx
cmp $0x6,%ebx
jg loop
arraycopy reversecopy
67
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: What about memory layout?
• ParallelOld GC
– AddressOf(a) == 0x76d890628
– AddressOf(b) == 0x76da90638
– AddressOf(b) − AddressOf(a) == 2Mb + 16
• G1 GC
– AddressOf(a) == 0x6c7200000
– AddressOf(b) == 0x6c7500000
– AddressOf(b) − AddressOf(a) == 3Mb
68
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: What about memory layout?
• ParallelOld GC
– AddressOf(a) == 0x76d890628
– AddressOf(b) == 0x76da90638
– AddressOf(b) − AddressOf(a) == 2Mb + 16
• G1 GC
– AddressOf(a) == 0x6c7200000
– AddressOf(b) == 0x6c7500000
– AddressOf(b) − AddressOf(a) == 3Mb
68
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: 4K-aliasing
HW uses 12 lower bits of address to detect Store Buffer conflicts.
• address difference 4K (12 bit)
• “Load” can’t bypass “Store”
• “Store Forwarding” can’t help - different addresses.
HW recovery:
– wait until “Store” is finished
– “clear pipeline” in case of speculation
69
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: arraycopy trace
Load A;
Store B;
Load A + 32;
Store B + 32;
Load A + 64;
Store B + 64;
...
70
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: arraycopy trace
B == A + 2M + 16;
Load A;
Store A + 2M + 16;
Load A + 32;
Store A + 2M + 48;
Load A + 64;
Store A + 2M + 80;
...
71
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: arraycopy trace
B == A + 2M + 16;
address % 4096
Load A; 0
Store A + 2M + 16; 16
Load A + 32; 32
Store A + 2M + 48; 48
Load A + 64; 64
Store A + 2M + 80; 80
...
4K-aliasing
72
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: 1K copying
73
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: 1K copying
Everything fine
73
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: 1K copying
Everything fine
Misaligned access
73
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: 1K copying
Everything fine
Misaligned access
”4K-aliasing”
73
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: too many details
74
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: It’s not the end
turned on ”Large Pages”
addresses difference 1M
75
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Demo 1: All together
Data copying performance depends on
how data located in memory
• Line split
• Page split
• 4K-aliasing
• ”1M & large pages aliasing” (still didn’t find an explanation)
76
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Conclusion
77
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
To read!
• “What Every Programmer Should Know About Memory”
Ulrich Drepper
• “Computer Architecture: A Quantitative Approach”
John L. Hennessy, David A. Patterson
• CPU vendors documentation
• http://www.agner.org/optimize/
• etc.
78
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Thank you!
79
Copyright © 2018, Oracle and/or its afliates. All rights reserved.
Q & A ?
80
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
cpu-microarchitecture

“Quantum” Performance Effects: beyond the Core

  • 1.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. “Quantum” Performance Effects: Beyond The Core Sergey Kuksenko Java Platform Group, Oracle October, 2018
  • 2.
    InfoQ.com: News &Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ cpu-microarchitecture
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Safe Harbor Statement The following is intended to outline our general product directon. It is intended for informaton purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functonality, and should not be relied upon in making purchasing decisions. The development, release, tming, and pricing of any features or functonality described for Oracle’s products may change and remains at the sole discreton of Oracle Corporaton.
  • 5.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. About me • Java/JVM Performance Engineer at Oracle, @since 2010 • Java/JVM Performance Engineer, @since 2005 • Java/JVM Engineer, @since 1996 3
  • 6.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. System Under Test • Intel ® Core — i5-5300U [2.3 GHz] 1x2x2 – μarch: Haswell – launched: Q1’2015s • OS: Xubuntu 18.04 (64-bits) (4.15.0-36-generic) • Java 8 (64-bits) • Java 11 (64-bits) 4
  • 7.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo code https://github.com/kuksenko/quantum2 5
  • 8.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo code https://github.com/kuksenko/quantum2 • Required: JMH (Java Microbenchmark Harness) – http://openjdk.java.net/projects/code-tools/jmh/ 5
  • 9.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: How to copy 2 Mbytes. 6
  • 10.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1 int[] a = new int[512*1024]; int[] b = new int[512*1024]; @Benchmark public void arraycopy() { System.arraycopy(a, 0, b, 0, a.length); } @Benchmark public void reversecopy() { for(int i = a.length - 1; i >= 0; i--) { b[i] = a[i]; } } 7
  • 11.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1 int[] a = new int[512*1024]; int[] b = new int[512*1024]; @Benchmark public void arraycopy() { System.arraycopy(a, 0, b, 0, a.length); } @Benchmark public void reversecopy() { for(int i = a.length - 1; i >= 0; i--) { b[i] = a[i]; } } 740 μs 300 μs ?? * Using Java 8 7
  • 12.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Conclusions? • Oracle engineers - rubbish! – I know how to copy faster! 8
  • 13.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Conclusions? • Oracle engineers - rubbish! – I know how to copy faster! 8
  • 14.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Shared results within team • What I got: – arraycopy vs reversecopy: 740 vs 300 μs 9
  • 15.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Shared results within team • What I got: – arraycopy vs reversecopy: 740 vs 300 μs • What Bob got (on some MacBook Pro): – arraycopy vs reversecopy: 190 vs 185 μs 9
  • 16.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Shared results within team • What I got: – arraycopy vs reversecopy: 740 vs 300 μs • What Bob got (on some MacBook Pro): – arraycopy vs reversecopy: 190 vs 185 μs • What Alice got (she already migrated to JDK11): – arraycopy vs reversecopy: 270 vs 280 μs 9
  • 17.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Shared results within team • What I got: – arraycopy vs reversecopy: 740 vs 300 μs • What Bob got (on some MacBook Pro): – arraycopy vs reversecopy: 190 vs 185 μs • What Alice got (she already migrated to JDK11): – arraycopy vs reversecopy: 270 vs 280 μs • What if copy less data ”2Mbytes - 32 bytes”: – arraycopy vs reversecopy: 280 vs 720 μs 9
  • 18.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. spent a billion on research • MacOS doesn’t support ”Large Pages”! – Ubuntu - ”Transparent Huge Pages” • G1 is default GC since Java 9! – Java 8 default GC - ”ParallelOld” 10
  • 19.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. spent a billion on research • MacOS doesn’t support ”Large Pages”! – Ubuntu - ”Transparent Huge Pages” • G1 is default GC since Java 9! – Java 8 default GC - ”ParallelOld” Conclusions: • Large Pages - Rubbish! • G1 GC - Cool! 10
  • 20.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. spent a billion on research • MacOS doesn’t support ”Large Pages”! – Ubuntu - ”Transparent Huge Pages” • G1 is default GC since Java 9! – Java 8 default GC - ”ParallelOld” Conclusions: • Large Pages - Rubbish! • G1 GC - Cool! 10
  • 21.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. To Be Continued ... 11
  • 22.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: How many data? 12
  • 23.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: The Last Jedi Refactoring public class MyData { private byte[] bytes; private int length; public MyData(int length) { this.bytes = new byte[length]; this.length = length; } public int length() { return length; } public byte[] bytes() { return bytes; } } 13
  • 24.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: dataSize(MyData) MyData[] data = new MyData[256]; @Setup public void setup() { Random rnd = new Random(); Arrays.setAll(data, i -> new MyData(512 * 1024 + rnd.nextInt(64 * 1024))); } @Benchmark public int dataSize() { int s = 0; for (MyData a : data) { s += a.length(); } return s; } 14
  • 25.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: dataSize(byte[]) byte[][] data = new byte[256][]; @Setup public void setup() { Random rnd = new Random(); Arrays.setAll(data, i -> new byte[512 * 1024 + rnd.nextInt(64 * 1024)]); } @Benchmark public int dataSize() { int s = 0; for (byte[] a : data) { s += a.length; } return s; } 15
  • 26.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: results (Java 8) DataSize(MyData) 145 ns DataSize(byte[]) 200 ns ? 16
  • 27.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: results (Java 8) DataSize(MyData) 145 ns DataSize(byte[]) 200 ns What if turn on G1? (-XX:+UseG1GC) ? 16
  • 28.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: results (Java 8) DataSize(MyData) 145 ns DataSize(byte[]) 200 ns What if turn on G1? (-XX:+UseG1GC) DataSize(MyData) 145 ns DataSize(byte[]) 13045 ns ? ??? 16
  • 29.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: results What if turn off ”Large Pages”? ParallelOld GC: DataSize(MyData) 145 ns DataSize(byte[]) 250 ns G1 GC: DataSize(MyData) 145 ns DataSize(byte[]) 635 ns ?? 17
  • 30.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: Conclusions Conclusions: • Large Pages - Rubbish! • G1 GC - Rubbish! 18
  • 31.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: Conclusions Conclusions: • Large Pages - Rubbish! • G1 GC - Rubbish! 18
  • 32.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. To Be Continued ... 19
  • 33.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Why we are here? 20
  • 34.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Caches, caches everywhere 21
  • 35.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Caches in numbers (Intel Core i5-5300U) L1 - 32K, 8-way, latency: 4 cycles L2 - 256K, 8-way, latency: 12 cycles L3 - 3M, 12-way, latency: 35(and more) cycles - cache line - 64 bytes 22
  • 36.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: memory access cost. 23
  • 37.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: walking on memory Node root; @Benchmark @OperationsPerInvocation(COUNT) public int walk() { return forward(root, COUNT); } public int forward(Node node, int cnt) { for(int i=0; i < cnt; i++) { node = node.next; } return node.value; } 24
  • 38.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: walking on memory 2.2 ns 5.2 ns 15.4 ns 35 ns 25
  • 39.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: walking on memory What about HW prefetching? 26
  • 40.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: different mix 27
  • 41.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: different mix 28
  • 42.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: different mix 2.7x 29
  • 43.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: different mix 4x 30
  • 44.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 3: different mix 12.6x 31
  • 45.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: to split or not to split? 32
  • 46.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Good old Unsafe! Unsafe UNSAFE; long from; // page alignment @Param({"-8", "-4", "-2", "0", "2", "4", "8" }) int offset; // offset in bytes @Benchmark public long getlong() { return UNSAFE.getLong(a, from + offset); } @Benchmark public void putlong() { UNSAFE.putLong(a, from + offset, 42L); } 33
  • 47.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Results offset getlong putlong -8 5.0 1.8 -4 19.1 17.8 0 5.0 1.8 60 5.2 2.5 64 5.0 1.8 time, ns/op 34
  • 48.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Results offset getlong putlong -8 5.0 1.8 -4 19.1 17.8 0 5.0 1.8 60 5.2 2.5 64 5.0 1.8 time, ns/op unaligned data: Page Split! Line Split! 34
  • 49.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Misalignment But wait! Java doesn’t have misaligned data! 35
  • 50.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Misalignment But wait! Java doesn’t have misaligned data! There are no misaligned data, but there are misaligned operations. 35
  • 51.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Misalignment Java misaligned access: • Unsafe/VarHandle – Buffers – Offheap • SIMD instructions (SSE, AVX ...) – HotSpot intrinsics (System.arraycopy, Arrays.fill ...) – Automatic vectorization 36
  • 52.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Arrays.fill int from; // alignment to page boundary int size; int offset; byte[] a; @Benchmark public void fill() { Arrays.fill(a, from + offset, from + offset + size, (byte)42); } 37
  • 53.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Arrays.fill, 512 bytes 38
  • 54.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 4: Arrays.fill, 512 bytes 39
  • 55.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: upside down 40
  • 56.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: matrix transpose int size; double[][] matrix = new double[size][size]; @Benchmark public void transpose() { for (int i = 1; i < size; i++) { for (int j = 0; j < i; j++) { double tmp = matrix[i][j]; matrix[i][j] = matrix[j][i]; matrix[j][i] = tmp; } } } 41
  • 57.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: results (NxN) N N+0 N+1 N+2 N+3 80 μs 42
  • 58.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: results (NxN) N N+0 N+1 N+2 94 μs N+3 80 μs 42
  • 59.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: results (NxN) N N+0 88 μs N+1 N+2 94 μs N+3 80 μs 42
  • 60.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: results (NxN) N N+0 88 μs N+1 350 μs N+2 94 μs N+3 80 μs 42
  • 61.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: results (NxN) N 253 88 μs 254 350 μs 255 94 μs 256 80 μs 42
  • 62.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 5: matrix transpose 43
  • 63.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Cache Associativity 44
  • 64.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Critical Stride 〈Crtc Strde〉 = 〈Cche Sze〉 〈Assoctty〉 • L1 (32K, 8-way) ⇒ 4K • L2 (256K, 8-way) ⇒ 32K • L3 (3M, 12-way) ⇒ 256K 45
  • 65.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 2: How many data?(cont.) 46
  • 66.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. ”critical stride” hit Let’s count: • G1 GC – all arrays are aligned to 1M (256K, 32K, 4K) • ParallelOld GC – 256 arrays ⇒ 254 different ”index sets” в L3 – 256 arrays ⇒ 251 different ”index sets” в L2 – 256 arrays ⇒ 62 different ”index sets” в L1 – number of hits to L1 index sets: 10, 9, 8, 8, 8, 7, 7... 47
  • 67.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 6: the rich get richer 48
  • 68.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 6: Walking dead threads @Benchmark @Group("pair") @OperationsPerInvocation(COUNT) public int bob() { return forward(root, COUNT); } @Benchmark @Group("pair") @OperationsPerInvocation(COUNT) public int alice() { return forward(root, COUNT); } Each thread has it’s own root and independent data. 49
  • 69.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 6: 128K per thread Iteration 1: bob: 5.246 ns/op alice: 5.241 ns/op Iteration 2: bob: 5.254 ns/op alice: 5.272 ns/op Iteration 3: bob: 5.233 ns/op alice: 5.244 ns/op Iteration 4: bob: 5.244 ns/op alice: 5.232 ns/op 50
  • 70.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 6: 1M per thread Iteration 1: bob: 14.495 ns/op alice: 14.614 ns/op Iteration 2: bob: 14.289 ns/op alice: 14.331 ns/op Iteration 3: bob: 14.242 ns/op alice: 14.296 ns/op Iteration 4: bob: 14.332 ns/op alice: 14.332 ns/op 51
  • 71.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 6: 2M per thread Iteration 1: bob: 17.199 ns/op alice: 48.845 ns/op Iteration 2: bob: 46.777 ns/op alice: 20.850 ns/op Iteration 3: bob: 17.046 ns/op alice: 48.686 ns/op Iteration 4: bob: 46.422 ns/op alice: 20.704 ns/op Fight for LLC! 52
  • 72.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 6: 2M per thread Iteration 1: bob: 17.199 ns/op alice: 48.845 ns/op Iteration 2: bob: 46.777 ns/op alice: 20.850 ns/op Iteration 3: bob: 17.046 ns/op alice: 48.686 ns/op Iteration 4: bob: 46.422 ns/op alice: 20.704 ns/op Fight for LLC! ∼ 1 ≤ 〈Tot Workng Set〉 〈LLC sze〉 ≤∼ 2.5 52
  • 73.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 7: Bytes histogram 53
  • 74.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 7: count bytes frequency byte[] source; // SIZE == 16 * K; @Benchmark public int[] count1() { int[] table = new int[256]; for (byte v : source) { table[v & 0xFF]++; } return table; } 54
  • 75.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 7: count bytes frequency byte[] source; // SIZE == 16 * K; @Benchmark public int[] count1() { int[] table = new int[256]; for (byte v : source) { table[v & 0xFF]++; } return table; } 13.7 μs 54
  • 76.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 7: count bytes frequency byte[] source; // SIZE == 16 * K; @Benchmark public int[] count1() { int[] table = new int[256]; for (byte v : source) { table[v & 0xFF]++; } return table; } 13.7 μs What if the data is unevenly distributed? 54
  • 77.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Results 55
  • 78.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Store Buffer CPU L1 Cache 4-5 clocks 56
  • 79.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Store Buffer CPU L1 Cache 4-5 clocks Store Buffer 56
  • 80.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Store Forwarding Store A; Load B; Load A; 57
  • 81.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Store Forwarding Store A; Load B; Load A; * Store Buffer Store A; 57
  • 82.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Store Forwarding Store A; Load B; Load A; * Store Buffer Store A; No “B” in Store Buffer Execute! even before Store 57
  • 83.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Store Forwarding Store A; Load B; Load A;* Store Buffer Store A; “A” exists in Store Buffer What to do? 57
  • 84.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Hit to “Store Buffer” • Wait until “Store A” reaches L1 (expensive) • Take value from Store Buffer (a.k.a. “Store Forwarding”) 58
  • 85.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Let’s do this @Benchmark public int[] count2() { int[] table0 = new int[256]; int[] table1 = new int[256]; for (int i = 0; i < source.length; ) { table0[source[i++] & 0xFF]++; table1[source[i++] & 0xFF]++; } for (int i = 0; i < 256; i++) { table0[i] += table1[i]; } return table0; } 59
  • 86.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. ... and this @Benchmark public int[] count4() { int[] table0 = new int[256]; int[] table1 = new int[256]; int[] table2 = new int[256]; int[] table3 = new int[256]; for (int i = 0; i < source.length; ) { table0[source[i++] & 0xFF]++; table1[source[i++] & 0xFF]++; table2[source[i++] & 0xFF]++; table3[source[i++] & 0xFF]++; } for (int i = 0; i < 256; i++) { table0[i] += table1[i] + table2[i] + table3[i]; } return table0; } 60
  • 87.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Results 61
  • 88.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 8: bytes ⇔ int 62
  • 89.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 8: bytes ⇔ int ByteBuffer buf = ByteBuffer.allocateDirect(4); @Benchmark public int bytesToInt() { buf.put(0, b0); buf.put(1, b1); buf.put(2, b2); buf.put(3, b3); return buf.getInt(0); } @Benchmark public int intToBytes() { buf.putInt(0, i0); return buf.get(0) + buf.get(1) + buf.get(2) + buf.get(3); } 63
  • 90.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 8: bytes ⇔ int ByteBuffer buf = ByteBuffer.allocateDirect(4); @Benchmark public int bytesToInt() { buf.put(0, b0); buf.put(1, b1); buf.put(2, b2); buf.put(3, b3); return buf.getInt(0); } @Benchmark public int intToBytes() { buf.putInt(0, i0); return buf.get(0) + buf.get(1) + buf.get(2) + buf.get(3); } 13.2 ns 7.9 ns 63
  • 91.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 8: Store Forwarding success int byte byte byte byte Store Load Load Load Load 64
  • 92.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 8: Store Forwarding fail int byte byte byte byte Store Store Store Store Load 65
  • 93.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: back to arraycopy 66
  • 94.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: looking into asm loop: vmovdqu -0x38(%rdi,%rdx,8),%ymm0 vmovdqu %ymm0,-0x38(%rsi,%rdx,8) vmovdqu -0x18(%rdi,%rdx,8),%ymm1 vmovdqu %ymm1,-0x18(%rsi,%rdx,8) add $0x8,%rdx jle loop loop: vmovdqu -0xc(%r8,%rbx,4),%ymm0 vmovdqu %ymm0,-0xc(%r10,%rbx,4) add $0xfffffff8,%ebx cmp $0x6,%ebx jg loop arraycopy reversecopy 67
  • 95.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: What about memory layout? • ParallelOld GC – AddressOf(a) == 0x76d890628 – AddressOf(b) == 0x76da90638 – AddressOf(b) − AddressOf(a) == 2Mb + 16 • G1 GC – AddressOf(a) == 0x6c7200000 – AddressOf(b) == 0x6c7500000 – AddressOf(b) − AddressOf(a) == 3Mb 68
  • 96.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: What about memory layout? • ParallelOld GC – AddressOf(a) == 0x76d890628 – AddressOf(b) == 0x76da90638 – AddressOf(b) − AddressOf(a) == 2Mb + 16 • G1 GC – AddressOf(a) == 0x6c7200000 – AddressOf(b) == 0x6c7500000 – AddressOf(b) − AddressOf(a) == 3Mb 68
  • 97.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: 4K-aliasing HW uses 12 lower bits of address to detect Store Buffer conflicts. • address difference 4K (12 bit) • “Load” can’t bypass “Store” • “Store Forwarding” can’t help - different addresses. HW recovery: – wait until “Store” is finished – “clear pipeline” in case of speculation 69
  • 98.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: arraycopy trace Load A; Store B; Load A + 32; Store B + 32; Load A + 64; Store B + 64; ... 70
  • 99.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: arraycopy trace B == A + 2M + 16; Load A; Store A + 2M + 16; Load A + 32; Store A + 2M + 48; Load A + 64; Store A + 2M + 80; ... 71
  • 100.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: arraycopy trace B == A + 2M + 16; address % 4096 Load A; 0 Store A + 2M + 16; 16 Load A + 32; 32 Store A + 2M + 48; 48 Load A + 64; 64 Store A + 2M + 80; 80 ... 4K-aliasing 72
  • 101.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: 1K copying 73
  • 102.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: 1K copying Everything fine 73
  • 103.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: 1K copying Everything fine Misaligned access 73
  • 104.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: 1K copying Everything fine Misaligned access ”4K-aliasing” 73
  • 105.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: too many details 74
  • 106.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: It’s not the end turned on ”Large Pages” addresses difference 1M 75
  • 107.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Demo 1: All together Data copying performance depends on how data located in memory • Line split • Page split • 4K-aliasing • ”1M & large pages aliasing” (still didn’t find an explanation) 76
  • 108.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Conclusion 77
  • 109.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. To read! • “What Every Programmer Should Know About Memory” Ulrich Drepper • “Computer Architecture: A Quantitative Approach” John L. Hennessy, David A. Patterson • CPU vendors documentation • http://www.agner.org/optimize/ • etc. 78
  • 110.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Thank you! 79
  • 111.
    Copyright © 2018,Oracle and/or its afliates. All rights reserved. Q & A ? 80
  • 112.
    Watch the videowith slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ cpu-microarchitecture