Let’s talk about microbenchmarking

Andrey Akinshin
Andrey AkinshinSoftware developer at JetBrains
Let’s talk about microbenchmarking
Andrey Akinshin, JetBrains
DotNext 2016 Helsinki
1/29
BenchmarkDotNet
2/29
Today’s environment
• BenchmarkDotNet v0.10.1
• Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10
• .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0
• Source code: https://git.io/v1RRX
3/29
Today’s environment
• BenchmarkDotNet v0.10.1
• Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10
• .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0
• Source code: https://git.io/v1RRX
Other environments:
C# compiler old csc / Roslyn
CLR CLR2 / CLR4 / CoreCLR / Mono
OS Windows / Linux / MacOS / FreeBSD
JIT LegacyJIT-x86 / LegacyJIT-x64 / RyuJIT-x64
GC MS (different CLRs) / Mono (Boehm/Sgen)
Toolchain JIT / NGen / .NET Native
Hardware ∞ different configurations
. . . . . .
And don’t forget about multiple versions
3/29
Count of iterations
A bad benchmark
// Resolution (Stopwatch) = 466 ns
// Latency (Stopwatch) = 18 ns
var sw = Stopwatch.StartNew();
Foo(); // 100 ns
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
4/29
Count of iterations
A bad benchmark
// Resolution (Stopwatch) = 466 ns
// Latency (Stopwatch) = 18 ns
var sw = Stopwatch.StartNew();
Foo(); // 100 ns
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
A better benchmark
var sw = Stopwatch.StartNew();
for (int i = 0; i < N; i++) // (N * 100 + eps) ns
Foo(); // 100 ns
sw.Stop();
var total = sw.ElapsedTicks / Stopwatch.Frequency;
WriteLine(total / N);
4/29
Several launches
Run 01 : 529.8674 ns/op
Run 02 : 532.7541 ns/op
Run 03 : 558.7448 ns/op
Run 04 : 555.6647 ns/op
Run 05 : 539.6401 ns/op
Run 06 : 539.3494 ns/op
Run 07 : 564.3222 ns/op
Run 08 : 551.9544 ns/op
Run 09 : 550.1608 ns/op
Run 10 : 533.0634 ns/op
5/29
Several launches
6/29
A simple case
Central limit theorem to the rescue!
7/29
Complicated cases
8/29
Latencies
Event Latency
1 CPU cycle 0.3 ns
Level 1 cache access 0.9 ns
Level 2 cache access 2.8 ns
Level 3 cache access 12.9 ns
Main memory access 120 ns
Solid-state disk I/O 50-150 µs
Rotational disk I/O 1-10 ms
Internet: SF to NYC 40 ms
Internet: SF to UK 81 ms
Internet: SF to Australia 183 ms
OS virtualization reboot 4 sec
Hardware virtualization reboot 40 sec
Physical system reboot 5 min
© Systems Performance: Enterprise and the Cloud
9/29
Sum of elements
const int N = 1024;
int[,] a = new int[N, N];
[Benchmark]
public double SumIJ()
{
var sum = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += a[i, j];
return sum;
}
[Benchmark]
public double SumJI()
{
var sum = 0;
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += a[i, j];
return sum;
}
10/29
Sum of elements
const int N = 1024;
int[,] a = new int[N, N];
[Benchmark]
public double SumIJ()
{
var sum = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += a[i, j];
return sum;
}
[Benchmark]
public double SumJI()
{
var sum = 0;
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += a[i, j];
return sum;
}
CPU cache effect:
SumIJ SumJI
LegacyJIT-x86 ≈1.3ms ≈4.0ms
10/29
Cache-sensitive benchmarks
Let’s run a benchmark several times:
int[] x = new int[128 * 1024 * 1024];
for (int iter = 0; iter < 5; iter++)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < x.Length; i += 16)
x[i]++;
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
}
11/29
Cache-sensitive benchmarks
Let’s run a benchmark several times:
int[] x = new int[128 * 1024 * 1024];
for (int iter = 0; iter < 5; iter++)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < x.Length; i += 16)
x[i]++;
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
}
176 // not warmed
81 // still not warmed
62 // the steady state
62 // the steady state
62 // the steady state
Warmup is not only about .NET
11/29
Branch prediction
const int N = 32767;
int[] sorted, unsorted; // random numbers [0..255]
private static int Sum(int[] data)
{
int sum = 0;
for (int i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
return sum;
}
[Benchmark]
public int Sorted()
{
return Sum(sorted);
}
[Benchmark]
public int Unsorted()
{
return Sum(unsorted);
}
12/29
Branch prediction
const int N = 32767;
int[] sorted, unsorted; // random numbers [0..255]
private static int Sum(int[] data)
{
int sum = 0;
for (int i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
return sum;
}
[Benchmark]
public int Sorted()
{
return Sum(sorted);
}
[Benchmark]
public int Unsorted()
{
return Sum(unsorted);
}
Sorted Unsorted
LegacyJIT-x86 ≈20µs ≈139µs
12/29
Isolation
A bad benchmark
var sw1 = Stopwatch.StartNew();
Foo();
sw1.Stop();
var sw2 = Stopwatch.StartNew();
Bar();
sw2.Stop();
13/29
Isolation
A bad benchmark
var sw1 = Stopwatch.StartNew();
Foo();
sw1.Stop();
var sw2 = Stopwatch.StartNew();
Bar();
sw2.Stop();
In general case, you should run each benchmark in
his own process. Remember about:
• Interface method dispatch
• Garbage collector and autotuning
• Conditional jitting
13/29
Interface method dispatch
private interface IInc {
double Inc(double x);
}
private class Foo : IInc {
double Inc(double x) => x + 1;
}
private class Bar : IInc {
double Inc(double x) => x + 1;
}
private double Run(IInc inc) {
double sum = 0;
for (int i = 0; i < 1001; i++)
sum += inc.Inc(0);
return sum;
}
// Which method is faster?
[Benchmark]
public double FooFoo() {
var foo1 = new Foo();
var foo2 = new Foo();
return Run(foo1) + Run(foo2);
}
[Benchmark]
public double FooBar() {
var foo = new Foo();
var bar = new Bar();
return Run(foo) + Run(bar);
}
14/29
Interface method dispatch
private interface IInc {
double Inc(double x);
}
private class Foo : IInc {
double Inc(double x) => x + 1;
}
private class Bar : IInc {
double Inc(double x) => x + 1;
}
private double Run(IInc inc) {
double sum = 0;
for (int i = 0; i < 1001; i++)
sum += inc.Inc(0);
return sum;
}
// Which method is faster?
[Benchmark]
public double FooFoo() {
var foo1 = new Foo();
var foo2 = new Foo();
return Run(foo1) + Run(foo2);
}
[Benchmark]
public double FooBar() {
var foo = new Foo();
var bar = new Bar();
return Run(foo) + Run(bar);
}
FooFoo FooBar
LegacyJIT-x64 ≈5.4µs ≈7.1µs
14/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
15/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64
≈1.7ns 0 ≈1.7ns
15/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64
≈1.7ns 0 ≈1.7ns
; LegacyJIT-x64 : Inlining succeeded
mov ecx,23h
ret
15/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64
≈1.7ns 0 ≈1.7ns
; LegacyJIT-x64 : Inlining succeeded
mov ecx,23h
ret
// RyuJIT-x64 : Inlining failed
// Inline expansion aborted due to opcode
// [06] OP_starg.s in method
// Program:WithStarg(int):int:this
15/29
SIMD
struct MyVector // Copy-pasted from System.Numerics.Vector4
{
public float X, Y, Z, W;
public MyVector(float x, float y, float z, float w)
{
X = x; Y = y; Z = z; W = w;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static MyVector operator *(MyVector left, MyVector right)
{
return new MyVector(left.X * right.X, left.Y * right.Y,
left.Z * right.Z, left.W * right.W);
}
}
Vector4 vector1, vector2, vector3;
MyVector myVector1, myVector2, myVector3;
[Benchmark] void MyMul() => myVector3 = myVector1 * myVector2;
[Benchmark] void BclMul() => vector3 = vector1 * vector2;
16/29
SIMD
struct MyVector // Copy-pasted from System.Numerics.Vector4
{
public float X, Y, Z, W;
public MyVector(float x, float y, float z, float w)
{
X = x; Y = y; Z = z; W = w;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static MyVector operator *(MyVector left, MyVector right)
{
return new MyVector(left.X * right.X, left.Y * right.Y,
left.Z * right.Z, left.W * right.W);
}
}
Vector4 vector1, vector2, vector3;
MyVector myVector1, myVector2, myVector3;
[Benchmark] void MyMul() => myVector3 = myVector1 * myVector2;
[Benchmark] void BclMul() => vector3 = vector1 * vector2;
LegacyJIT-x64 RyuJIT-x64
MyMul ≈12.9ns ≈2.5ns
BclMul ≈12.9ns ≈0.2ns
16/29
How so?
LegacyJIT-x64 RyuJIT-x64
MyMul ≈12.9ns ≈2.5ns
BclMul ≈12.9ns ≈0.2ns
; LegacyJIT-x64
; MyMul, BclMul: Na¨ıve SSE
; ...
movss xmm3,dword ptr [rsp+40h]
mulss xmm3,dword ptr [rsp+30h]
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,dword ptr [rsp+34h]
movss xmm1,dword ptr [rsp+48h]
mulss xmm1,dword ptr [rsp+38h]
movss xmm0,dword ptr [rsp+4Ch]
mulss xmm0,dword ptr [rsp+3Ch]
xor eax,eax
mov qword ptr [rsp],rax
mov qword ptr [rsp+8],rax
lea rax,[rsp]
movss dword ptr [rax],xmm3
movss dword ptr [rax+4],xmm2
; ...
; RyuJIT-x64
; MyMul: Na¨ıve AVX
; ...
vmulss xmm0,xmm0,xmm4
vmulss xmm1,xmm1,xmm5
vmulss xmm2,xmm2,xmm6
vmulss xmm3,xmm3,xmm7
; ...
; BclMul: Smart AVX intrinsic
vmovupd xmm0,xmmword ptr [rcx+8]
vmovupd xmm1,xmmword ptr [rcx+18h]
vmulps xmm0,xmm0,xmm1
vmovupd xmmword ptr [rcx+28h],xmm0
17/29
Let’s calculate some square roots
[Benchmark]
double Sqrt13() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13);
VS
[Benchmark]
double Sqrt14() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13) + Math.Sqrt(14);
18/29
Let’s calculate some square roots
[Benchmark]
double Sqrt13() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13);
VS
[Benchmark]
double Sqrt14() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13) + Math.Sqrt(14);
RyuJIT-x64∗
Sqrt13 ≈91ns
Sqrt14 0 ns
∗Can be changed in future versions, see github.com/dotnet/coreclr/issues/987
18/29
How so?
RyuJIT-x64, Sqrt13
vsqrtsd xmm0,xmm0,mmword ptr [7FF94F9E4D28h]
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D30h]
vaddsd xmm0,xmm0,xmm1
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D38h]
vaddsd xmm0,xmm0,xmm1
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D40h]
vaddsd xmm0,xmm0,xmm1
; A lot of vqrtsd and vaddsd instructions
; ...
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D88h]
vaddsd xmm0,xmm0,xmm1
ret
RyuJIT-x64, Sqrt14
vmovsd xmm0,qword ptr [7FF94F9C4C80h] ; Const
ret
19/29
How so?
Big expression tree
* stmtExpr void (top level) (IL 0x000... ???)
| /--* mathFN double sqrt
| | --* dconst double 13.000000000000000
| /--* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 12.000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 11.000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 10.000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 9.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 8.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 7.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 6.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 5.0000000000000000
// ...
20/29
How so?
Constant folding in action
N001 [000001] dconst 1.0000000000000000 => $c0 {DblCns[1.000000]}
N002 [000002] mathFN => $c0 {DblCns[1.000000]}
N003 [000003] dconst 2.0000000000000000 => $c1 {DblCns[2.000000]}
N004 [000004] mathFN => $c2 {DblCns[1.414214]}
N005 [000005] + => $c3 {DblCns[2.414214]}
N006 [000006] dconst 3.0000000000000000 => $c4 {DblCns[3.000000]}
N007 [000007] mathFN => $c5 {DblCns[1.732051]}
N008 [000008] + => $c6 {DblCns[4.146264]}
N009 [000009] dconst 4.0000000000000000 => $c7 {DblCns[4.000000]}
N010 [000010] mathFN => $c1 {DblCns[2.000000]}
N011 [000011] + => $c8 {DblCns[6.146264]}
N012 [000012] dconst 5.0000000000000000 => $c9 {DblCns[5.000000]}
N013 [000013] mathFN => $ca {DblCns[2.236068]}
N014 [000014] + => $cb {DblCns[8.382332]}
N015 [000015] dconst 6.0000000000000000 => $cc {DblCns[6.000000]}
N016 [000016] mathFN => $cd {DblCns[2.449490]}
N017 [000017] + => $ce {DblCns[10.831822]}
N018 [000018] dconst 7.0000000000000000 => $cf {DblCns[7.000000]}
N019 [000019] mathFN => $d0 {DblCns[2.645751]}
N020 [000020] + => $d1 {DblCns[13.477573]}
...
21/29
Concurrency
22/29
True sharing
23/29
False sharing
24/29
False sharing in action
// It's an extremely na¨ıve benchmark
// Don't try this at home
int[] x = new int[1024];
void Inc(int p) {
for (int i = 0; i < 10000001; i++)
x[p]++;
}
void Run(int step) {
var sw = Stopwatch.StartNew();
Task.WaitAll(
Task.Factory.StartNew(() => Inc(0 * step)),
Task.Factory.StartNew(() => Inc(1 * step)),
Task.Factory.StartNew(() => Inc(2 * step)),
Task.Factory.StartNew(() => Inc(3 * step)));
WriteLine(sw.ElapsedMilliseconds);
}
25/29
False sharing in action
// It's an extremely na¨ıve benchmark
// Don't try this at home
int[] x = new int[1024];
void Inc(int p) {
for (int i = 0; i < 10000001; i++)
x[p]++;
}
void Run(int step) {
var sw = Stopwatch.StartNew();
Task.WaitAll(
Task.Factory.StartNew(() => Inc(0 * step)),
Task.Factory.StartNew(() => Inc(1 * step)),
Task.Factory.StartNew(() => Inc(2 * step)),
Task.Factory.StartNew(() => Inc(3 * step)));
WriteLine(sw.ElapsedMilliseconds);
}
Run(1) Run(256)
≈400ms ≈150ms
25/29
Conclusion: Benchmarking is hard
Anon et al., “A Measure of Transaction Processing Power”
There are lies, damn lies and then there are performance
measures.
26/29
Some good books
27/29
Questions?
Andrey Akinshin
http://aakinshin.net
https://github.com/AndreyAkinshin
https://twitter.com/andrey_akinshin
andrey.akinshin@gmail.com
28/29
1 of 41

Recommended

Ch01 basic concepts_nosoluiton by
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonshin
865 views35 slides
.NET 2015: Будущее рядом by
.NET 2015: Будущее рядом.NET 2015: Будущее рядом
.NET 2015: Будущее рядомAndrey Akinshin
538 views66 slides
Secure and privacy-preserving data transmission and processing using homomorp... by
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...DefCamp
470 views102 slides
Bartosz Milewski, “Re-discovering Monads in C++” by
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Platonov Sergey
2.3K views29 slides
Introduction to Homomorphic Encryption by
Introduction to Homomorphic EncryptionIntroduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionChristoph Matthies
13.5K views69 slides
A verifiable random function with short proofs and keys by
A verifiable random function with short proofs and keysA verifiable random function with short proofs and keys
A verifiable random function with short proofs and keysAleksandr Yampolskiy
767 views28 slides

More Related Content

What's hot

How to add an optimization for C# to RyuJIT by
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITEgor Bogatov
2.7K views45 slides
20140531 serebryany lecture02_find_scary_cpp_bugs by
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
885 views55 slides
20140531 serebryany lecture01_fantastic_cpp_bugs by
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
476 views43 slides
Advance C++notes by
Advance C++notesAdvance C++notes
Advance C++notesRajiv Gupta
1.6K views78 slides
Abstracting Vector Architectures in Library Generators: Case Study Convolutio... by
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...ETH Zurich
2.4K views26 slides
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ... by
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden
790 views48 slides

What's hot(20)

How to add an optimization for C# to RyuJIT by Egor Bogatov
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
Egor Bogatov2.7K views
Advance C++notes by Rajiv Gupta
Advance C++notesAdvance C++notes
Advance C++notes
Rajiv Gupta1.6K views
Abstracting Vector Architectures in Library Generators: Case Study Convolutio... by ETH Zurich
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
ETH Zurich2.4K views
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ... by Alex Pruden
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden790 views
ZK Study Club: Sumcheck Arguments and Their Applications by Alex Pruden
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
Alex Pruden190 views
To Swift 2...and Beyond! by Scott Gardner
To Swift 2...and Beyond!To Swift 2...and Beyond!
To Swift 2...and Beyond!
Scott Gardner32.9K views
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model by Alex Pruden
zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelzkStudy Club: Subquadratic SNARGs in the Random Oracle Model
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model
Alex Pruden131 views
Microsoft Word Hw#1 by kkkseld
Microsoft Word   Hw#1Microsoft Word   Hw#1
Microsoft Word Hw#1
kkkseld599 views
Some examples of the 64-bit code errors by PVS-Studio
Some examples of the 64-bit code errorsSome examples of the 64-bit code errors
Some examples of the 64-bit code errors
PVS-Studio311 views
Advance features of C++ by vidyamittal
Advance features of C++Advance features of C++
Advance features of C++
vidyamittal1.6K views
PVS-Studio team experience: checking various open source projects, or mistake... by Andrey Karpov
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
Andrey Karpov337 views
How To Crack RSA Netrek Binary Verification System by Jay Corrales
How To Crack RSA Netrek Binary Verification SystemHow To Crack RSA Netrek Binary Verification System
How To Crack RSA Netrek Binary Verification System
Jay Corrales85 views
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321 by Teddy Hsiung
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Teddy Hsiung140 views
Конверсия управляемых языков в неуправляемые by Platonov Sergey
Конверсия управляемых языков в неуправляемыеКонверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемые
Platonov Sergey973 views

Similar to Let’s talk about microbenchmarking

Tema3_Introduction_to_CUDA_C.pdf by
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
1 view69 slides
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl... by
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...Andrey Karpov
931 views39 slides
I dont know what is wrong with this roulette program I cant seem.pdf by
I dont know what is wrong with this roulette program I cant seem.pdfI dont know what is wrong with this roulette program I cant seem.pdf
I dont know what is wrong with this roulette program I cant seem.pdfarchanaemporium
3 views23 slides
Whats new in_csharp4 by
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4Abed Bukhari
942 views26 slides
JVM code reading -- C2 by
JVM code reading -- C2JVM code reading -- C2
JVM code reading -- C2ytoshima
2.1K views106 slides
PVS-Studio, a solution for resource intensive applications development by
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentOOO "Program Verification Systems"
1.4K views37 slides

Similar to Let’s talk about microbenchmarking(20)

Tema3_Introduction_to_CUDA_C.pdf by pepe464163
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
pepe4641631 view
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl... by Andrey Karpov
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
Andrey Karpov931 views
I dont know what is wrong with this roulette program I cant seem.pdf by archanaemporium
I dont know what is wrong with this roulette program I cant seem.pdfI dont know what is wrong with this roulette program I cant seem.pdf
I dont know what is wrong with this roulette program I cant seem.pdf
archanaemporium3 views
Whats new in_csharp4 by Abed Bukhari
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4
Abed Bukhari942 views
JVM code reading -- C2 by ytoshima
JVM code reading -- C2JVM code reading -- C2
JVM code reading -- C2
ytoshima2.1K views
#include stdafx.h using namespace std; #include stdlib.h.docx by ajoy21
#include stdafx.h using namespace std; #include stdlib.h.docx#include stdafx.h using namespace std; #include stdlib.h.docx
#include stdafx.h using namespace std; #include stdlib.h.docx
ajoy213 views
ぐだ生 Java入門第三回(文字コードの話)(Keynote版) by Makoto Yamazaki
ぐだ生 Java入門第三回(文字コードの話)(Keynote版)ぐだ生 Java入門第三回(文字コードの話)(Keynote版)
ぐだ生 Java入門第三回(文字コードの話)(Keynote版)
Makoto Yamazaki1.2K views
Covering a function using a Dynamic Symbolic Execution approach by Jonathan Salwan
Covering a function using a Dynamic Symbolic Execution approach Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach
Jonathan Salwan888 views
LSFMM 2019 BPF Observability by Brendan Gregg
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg8.3K views
COSCUP2023 RSA256 Verilator.pdf by Yodalee
COSCUP2023 RSA256 Verilator.pdfCOSCUP2023 RSA256 Verilator.pdf
COSCUP2023 RSA256 Verilator.pdf
Yodalee228 views
54602399 c-examples-51-to-108-programe-ee01083101 by premrings
54602399 c-examples-51-to-108-programe-ee0108310154602399 c-examples-51-to-108-programe-ee01083101
54602399 c-examples-51-to-108-programe-ee01083101
premrings2.7K views
Score (smart contract for icon) by Doyun Hwang
Score (smart contract for icon) Score (smart contract for icon)
Score (smart contract for icon)
Doyun Hwang148 views
とある断片の超動的言語 by Kiyotaka Oku
とある断片の超動的言語とある断片の超動的言語
とある断片の超動的言語
Kiyotaka Oku522 views

More from Andrey Akinshin

Поговорим про performance-тестирование by
Поговорим про performance-тестированиеПоговорим про performance-тестирование
Поговорим про performance-тестированиеAndrey Akinshin
368 views105 slides
Сложности performance-тестирования by
Сложности performance-тестированияСложности performance-тестирования
Сложности performance-тестированияAndrey Akinshin
277 views91 slides
Сложности микробенчмаркинга by
Сложности микробенчмаркингаСложности микробенчмаркинга
Сложности микробенчмаркингаAndrey Akinshin
521 views66 slides
Поговорим про память by
Поговорим про памятьПоговорим про память
Поговорим про памятьAndrey Akinshin
332 views116 slides
Кроссплатформенный .NET и как там дела с Mono и CoreCLR by
Кроссплатформенный .NET и как там дела с Mono и CoreCLRКроссплатформенный .NET и как там дела с Mono и CoreCLR
Кроссплатформенный .NET и как там дела с Mono и CoreCLRAndrey Akinshin
322 views32 slides
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва) by
 Теория и практика .NET-бенчмаркинга (25.01.2017, Москва) Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)Andrey Akinshin
3.6K views153 slides

More from Andrey Akinshin(20)

Поговорим про performance-тестирование by Andrey Akinshin
Поговорим про performance-тестированиеПоговорим про performance-тестирование
Поговорим про performance-тестирование
Andrey Akinshin368 views
Сложности performance-тестирования by Andrey Akinshin
Сложности performance-тестированияСложности performance-тестирования
Сложности performance-тестирования
Andrey Akinshin277 views
Сложности микробенчмаркинга by Andrey Akinshin
Сложности микробенчмаркингаСложности микробенчмаркинга
Сложности микробенчмаркинга
Andrey Akinshin521 views
Поговорим про память by Andrey Akinshin
Поговорим про памятьПоговорим про память
Поговорим про память
Andrey Akinshin332 views
Кроссплатформенный .NET и как там дела с Mono и CoreCLR by Andrey Akinshin
Кроссплатформенный .NET и как там дела с Mono и CoreCLRКроссплатформенный .NET и как там дела с Mono и CoreCLR
Кроссплатформенный .NET и как там дела с Mono и CoreCLR
Andrey Akinshin322 views
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва) by Andrey Akinshin
 Теория и практика .NET-бенчмаркинга (25.01.2017, Москва) Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
Andrey Akinshin3.6K views
Продолжаем говорить про арифметику by Andrey Akinshin
Продолжаем говорить про арифметикуПродолжаем говорить про арифметику
Продолжаем говорить про арифметику
Andrey Akinshin676 views
Теория и практика .NET-бенчмаркинга (02.11.2016, Екатеринбург) by Andrey Akinshin
Теория и практика .NET-бенчмаркинга (02.11.2016, Екатеринбург)Теория и практика .NET-бенчмаркинга (02.11.2016, Екатеринбург)
Теория и практика .NET-бенчмаркинга (02.11.2016, Екатеринбург)
Andrey Akinshin1.4K views
Поговорим про арифметику by Andrey Akinshin
Поговорим про арифметикуПоговорим про арифметику
Поговорим про арифметику
Andrey Akinshin120.8K views
Подружили CLR и JVM в Project Rider by Andrey Akinshin
Подружили CLR и JVM в Project RiderПодружили CLR и JVM в Project Rider
Подружили CLR и JVM в Project Rider
Andrey Akinshin794 views
Что нам готовит грядущий C#7? by Andrey Akinshin
Что нам готовит грядущий C#7?Что нам готовит грядущий C#7?
Что нам готовит грядущий C#7?
Andrey Akinshin376 views
Продолжаем говорить о микрооптимизациях .NET-приложений by Andrey Akinshin
Продолжаем говорить о микрооптимизациях .NET-приложенийПродолжаем говорить о микрооптимизациях .NET-приложений
Продолжаем говорить о микрооптимизациях .NET-приложений
Andrey Akinshin665 views
Распространённые ошибки оценки производительности .NET-приложений by Andrey Akinshin
Распространённые ошибки оценки производительности .NET-приложенийРаспространённые ошибки оценки производительности .NET-приложений
Распространённые ошибки оценки производительности .NET-приложений
Andrey Akinshin1K views
Поговорим о микрооптимизациях .NET-приложений by Andrey Akinshin
Поговорим о микрооптимизациях .NET-приложенийПоговорим о микрооптимизациях .NET-приложений
Поговорим о микрооптимизациях .NET-приложений
Andrey Akinshin855 views
Практические приёмы оптимизации .NET-приложений by Andrey Akinshin
Практические приёмы оптимизации .NET-приложенийПрактические приёмы оптимизации .NET-приложений
Практические приёмы оптимизации .NET-приложений
Andrey Akinshin2.4K views
Поговорим о различных версиях .NET by Andrey Akinshin
Поговорим о различных версиях .NETПоговорим о различных версиях .NET
Поговорим о различных версиях .NET
Andrey Akinshin726 views
Низкоуровневые оптимизации .NET-приложений by Andrey Akinshin
Низкоуровневые оптимизации .NET-приложенийНизкоуровневые оптимизации .NET-приложений
Низкоуровневые оптимизации .NET-приложений
Andrey Akinshin3.8K views
Основы работы с Git by Andrey Akinshin
Основы работы с GitОсновы работы с Git
Основы работы с Git
Andrey Akinshin3.8K views
Сборка мусора в .NET by Andrey Akinshin
Сборка мусора в .NETСборка мусора в .NET
Сборка мусора в .NET
Andrey Akinshin3.9K views
Об особенностях использования значимых типов в .NET by Andrey Akinshin
Об особенностях использования значимых типов в .NETОб особенностях использования значимых типов в .NET
Об особенностях использования значимых типов в .NET
Andrey Akinshin3.5K views

Recently uploaded

Vertical User Stories by
Vertical User StoriesVertical User Stories
Vertical User StoriesMoisés Armani Ramírez
14 views16 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
39 views1 slide
PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
15 views1 slide
Mini-Track: AI and ML in Network Operations Applications by
Mini-Track: AI and ML in Network Operations ApplicationsMini-Track: AI and ML in Network Operations Applications
Mini-Track: AI and ML in Network Operations ApplicationsNetwork Automation Forum
10 views24 slides
Melek BEN MAHMOUD.pdf by
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
14 views1 slide
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
38 views15 slides

Recently uploaded(20)

Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson92 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab21 views

Let’s talk about microbenchmarking

  • 1. Let’s talk about microbenchmarking Andrey Akinshin, JetBrains DotNext 2016 Helsinki 1/29
  • 3. Today’s environment • BenchmarkDotNet v0.10.1 • Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10 • .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0 • Source code: https://git.io/v1RRX 3/29
  • 4. Today’s environment • BenchmarkDotNet v0.10.1 • Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10 • .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0 • Source code: https://git.io/v1RRX Other environments: C# compiler old csc / Roslyn CLR CLR2 / CLR4 / CoreCLR / Mono OS Windows / Linux / MacOS / FreeBSD JIT LegacyJIT-x86 / LegacyJIT-x64 / RyuJIT-x64 GC MS (different CLRs) / Mono (Boehm/Sgen) Toolchain JIT / NGen / .NET Native Hardware ∞ different configurations . . . . . . And don’t forget about multiple versions 3/29
  • 5. Count of iterations A bad benchmark // Resolution (Stopwatch) = 466 ns // Latency (Stopwatch) = 18 ns var sw = Stopwatch.StartNew(); Foo(); // 100 ns sw.Stop(); WriteLine(sw.ElapsedMilliseconds); 4/29
  • 6. Count of iterations A bad benchmark // Resolution (Stopwatch) = 466 ns // Latency (Stopwatch) = 18 ns var sw = Stopwatch.StartNew(); Foo(); // 100 ns sw.Stop(); WriteLine(sw.ElapsedMilliseconds); A better benchmark var sw = Stopwatch.StartNew(); for (int i = 0; i < N; i++) // (N * 100 + eps) ns Foo(); // 100 ns sw.Stop(); var total = sw.ElapsedTicks / Stopwatch.Frequency; WriteLine(total / N); 4/29
  • 7. Several launches Run 01 : 529.8674 ns/op Run 02 : 532.7541 ns/op Run 03 : 558.7448 ns/op Run 04 : 555.6647 ns/op Run 05 : 539.6401 ns/op Run 06 : 539.3494 ns/op Run 07 : 564.3222 ns/op Run 08 : 551.9544 ns/op Run 09 : 550.1608 ns/op Run 10 : 533.0634 ns/op 5/29
  • 9. A simple case Central limit theorem to the rescue! 7/29
  • 11. Latencies Event Latency 1 CPU cycle 0.3 ns Level 1 cache access 0.9 ns Level 2 cache access 2.8 ns Level 3 cache access 12.9 ns Main memory access 120 ns Solid-state disk I/O 50-150 µs Rotational disk I/O 1-10 ms Internet: SF to NYC 40 ms Internet: SF to UK 81 ms Internet: SF to Australia 183 ms OS virtualization reboot 4 sec Hardware virtualization reboot 40 sec Physical system reboot 5 min © Systems Performance: Enterprise and the Cloud 9/29
  • 12. Sum of elements const int N = 1024; int[,] a = new int[N, N]; [Benchmark] public double SumIJ() { var sum = 0; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) sum += a[i, j]; return sum; } [Benchmark] public double SumJI() { var sum = 0; for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += a[i, j]; return sum; } 10/29
  • 13. Sum of elements const int N = 1024; int[,] a = new int[N, N]; [Benchmark] public double SumIJ() { var sum = 0; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) sum += a[i, j]; return sum; } [Benchmark] public double SumJI() { var sum = 0; for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += a[i, j]; return sum; } CPU cache effect: SumIJ SumJI LegacyJIT-x86 ≈1.3ms ≈4.0ms 10/29
  • 14. Cache-sensitive benchmarks Let’s run a benchmark several times: int[] x = new int[128 * 1024 * 1024]; for (int iter = 0; iter < 5; iter++) { var sw = Stopwatch.StartNew(); for (int i = 0; i < x.Length; i += 16) x[i]++; sw.Stop(); WriteLine(sw.ElapsedMilliseconds); } 11/29
  • 15. Cache-sensitive benchmarks Let’s run a benchmark several times: int[] x = new int[128 * 1024 * 1024]; for (int iter = 0; iter < 5; iter++) { var sw = Stopwatch.StartNew(); for (int i = 0; i < x.Length; i += 16) x[i]++; sw.Stop(); WriteLine(sw.ElapsedMilliseconds); } 176 // not warmed 81 // still not warmed 62 // the steady state 62 // the steady state 62 // the steady state Warmup is not only about .NET 11/29
  • 16. Branch prediction const int N = 32767; int[] sorted, unsorted; // random numbers [0..255] private static int Sum(int[] data) { int sum = 0; for (int i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; return sum; } [Benchmark] public int Sorted() { return Sum(sorted); } [Benchmark] public int Unsorted() { return Sum(unsorted); } 12/29
  • 17. Branch prediction const int N = 32767; int[] sorted, unsorted; // random numbers [0..255] private static int Sum(int[] data) { int sum = 0; for (int i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; return sum; } [Benchmark] public int Sorted() { return Sum(sorted); } [Benchmark] public int Unsorted() { return Sum(unsorted); } Sorted Unsorted LegacyJIT-x86 ≈20µs ≈139µs 12/29
  • 18. Isolation A bad benchmark var sw1 = Stopwatch.StartNew(); Foo(); sw1.Stop(); var sw2 = Stopwatch.StartNew(); Bar(); sw2.Stop(); 13/29
  • 19. Isolation A bad benchmark var sw1 = Stopwatch.StartNew(); Foo(); sw1.Stop(); var sw2 = Stopwatch.StartNew(); Bar(); sw2.Stop(); In general case, you should run each benchmark in his own process. Remember about: • Interface method dispatch • Garbage collector and autotuning • Conditional jitting 13/29
  • 20. Interface method dispatch private interface IInc { double Inc(double x); } private class Foo : IInc { double Inc(double x) => x + 1; } private class Bar : IInc { double Inc(double x) => x + 1; } private double Run(IInc inc) { double sum = 0; for (int i = 0; i < 1001; i++) sum += inc.Inc(0); return sum; } // Which method is faster? [Benchmark] public double FooFoo() { var foo1 = new Foo(); var foo2 = new Foo(); return Run(foo1) + Run(foo2); } [Benchmark] public double FooBar() { var foo = new Foo(); var bar = new Bar(); return Run(foo) + Run(bar); } 14/29
  • 21. Interface method dispatch private interface IInc { double Inc(double x); } private class Foo : IInc { double Inc(double x) => x + 1; } private class Bar : IInc { double Inc(double x) => x + 1; } private double Run(IInc inc) { double sum = 0; for (int i = 0; i < 1001; i++) sum += inc.Inc(0); return sum; } // Which method is faster? [Benchmark] public double FooFoo() { var foo1 = new Foo(); var foo2 = new Foo(); return Run(foo1) + Run(foo2); } [Benchmark] public double FooBar() { var foo = new Foo(); var bar = new Bar(); return Run(foo) + Run(bar); } FooFoo FooBar LegacyJIT-x64 ≈5.4µs ≈7.1µs 14/29
  • 22. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } 15/29
  • 23. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64 ≈1.7ns 0 ≈1.7ns 15/29
  • 24. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64 ≈1.7ns 0 ≈1.7ns ; LegacyJIT-x64 : Inlining succeeded mov ecx,23h ret 15/29
  • 25. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64 ≈1.7ns 0 ≈1.7ns ; LegacyJIT-x64 : Inlining succeeded mov ecx,23h ret // RyuJIT-x64 : Inlining failed // Inline expansion aborted due to opcode // [06] OP_starg.s in method // Program:WithStarg(int):int:this 15/29
  • 26. SIMD struct MyVector // Copy-pasted from System.Numerics.Vector4 { public float X, Y, Z, W; public MyVector(float x, float y, float z, float w) { X = x; Y = y; Z = z; W = w; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static MyVector operator *(MyVector left, MyVector right) { return new MyVector(left.X * right.X, left.Y * right.Y, left.Z * right.Z, left.W * right.W); } } Vector4 vector1, vector2, vector3; MyVector myVector1, myVector2, myVector3; [Benchmark] void MyMul() => myVector3 = myVector1 * myVector2; [Benchmark] void BclMul() => vector3 = vector1 * vector2; 16/29
  • 27. SIMD struct MyVector // Copy-pasted from System.Numerics.Vector4 { public float X, Y, Z, W; public MyVector(float x, float y, float z, float w) { X = x; Y = y; Z = z; W = w; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static MyVector operator *(MyVector left, MyVector right) { return new MyVector(left.X * right.X, left.Y * right.Y, left.Z * right.Z, left.W * right.W); } } Vector4 vector1, vector2, vector3; MyVector myVector1, myVector2, myVector3; [Benchmark] void MyMul() => myVector3 = myVector1 * myVector2; [Benchmark] void BclMul() => vector3 = vector1 * vector2; LegacyJIT-x64 RyuJIT-x64 MyMul ≈12.9ns ≈2.5ns BclMul ≈12.9ns ≈0.2ns 16/29
  • 28. How so? LegacyJIT-x64 RyuJIT-x64 MyMul ≈12.9ns ≈2.5ns BclMul ≈12.9ns ≈0.2ns ; LegacyJIT-x64 ; MyMul, BclMul: Na¨ıve SSE ; ... movss xmm3,dword ptr [rsp+40h] mulss xmm3,dword ptr [rsp+30h] movss xmm2,dword ptr [rsp+44h] mulss xmm2,dword ptr [rsp+34h] movss xmm1,dword ptr [rsp+48h] mulss xmm1,dword ptr [rsp+38h] movss xmm0,dword ptr [rsp+4Ch] mulss xmm0,dword ptr [rsp+3Ch] xor eax,eax mov qword ptr [rsp],rax mov qword ptr [rsp+8],rax lea rax,[rsp] movss dword ptr [rax],xmm3 movss dword ptr [rax+4],xmm2 ; ... ; RyuJIT-x64 ; MyMul: Na¨ıve AVX ; ... vmulss xmm0,xmm0,xmm4 vmulss xmm1,xmm1,xmm5 vmulss xmm2,xmm2,xmm6 vmulss xmm3,xmm3,xmm7 ; ... ; BclMul: Smart AVX intrinsic vmovupd xmm0,xmmword ptr [rcx+8] vmovupd xmm1,xmmword ptr [rcx+18h] vmulps xmm0,xmm0,xmm1 vmovupd xmmword ptr [rcx+28h],xmm0 17/29
  • 29. Let’s calculate some square roots [Benchmark] double Sqrt13() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13); VS [Benchmark] double Sqrt14() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13) + Math.Sqrt(14); 18/29
  • 30. Let’s calculate some square roots [Benchmark] double Sqrt13() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13); VS [Benchmark] double Sqrt14() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13) + Math.Sqrt(14); RyuJIT-x64∗ Sqrt13 ≈91ns Sqrt14 0 ns ∗Can be changed in future versions, see github.com/dotnet/coreclr/issues/987 18/29
  • 31. How so? RyuJIT-x64, Sqrt13 vsqrtsd xmm0,xmm0,mmword ptr [7FF94F9E4D28h] vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D30h] vaddsd xmm0,xmm0,xmm1 vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D38h] vaddsd xmm0,xmm0,xmm1 vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D40h] vaddsd xmm0,xmm0,xmm1 ; A lot of vqrtsd and vaddsd instructions ; ... vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D88h] vaddsd xmm0,xmm0,xmm1 ret RyuJIT-x64, Sqrt14 vmovsd xmm0,qword ptr [7FF94F9C4C80h] ; Const ret 19/29
  • 32. How so? Big expression tree * stmtExpr void (top level) (IL 0x000... ???) | /--* mathFN double sqrt | | --* dconst double 13.000000000000000 | /--* + double | | | /--* mathFN double sqrt | | | | --* dconst double 12.000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 11.000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 10.000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 9.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 8.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 7.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 6.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 5.0000000000000000 // ... 20/29
  • 33. How so? Constant folding in action N001 [000001] dconst 1.0000000000000000 => $c0 {DblCns[1.000000]} N002 [000002] mathFN => $c0 {DblCns[1.000000]} N003 [000003] dconst 2.0000000000000000 => $c1 {DblCns[2.000000]} N004 [000004] mathFN => $c2 {DblCns[1.414214]} N005 [000005] + => $c3 {DblCns[2.414214]} N006 [000006] dconst 3.0000000000000000 => $c4 {DblCns[3.000000]} N007 [000007] mathFN => $c5 {DblCns[1.732051]} N008 [000008] + => $c6 {DblCns[4.146264]} N009 [000009] dconst 4.0000000000000000 => $c7 {DblCns[4.000000]} N010 [000010] mathFN => $c1 {DblCns[2.000000]} N011 [000011] + => $c8 {DblCns[6.146264]} N012 [000012] dconst 5.0000000000000000 => $c9 {DblCns[5.000000]} N013 [000013] mathFN => $ca {DblCns[2.236068]} N014 [000014] + => $cb {DblCns[8.382332]} N015 [000015] dconst 6.0000000000000000 => $cc {DblCns[6.000000]} N016 [000016] mathFN => $cd {DblCns[2.449490]} N017 [000017] + => $ce {DblCns[10.831822]} N018 [000018] dconst 7.0000000000000000 => $cf {DblCns[7.000000]} N019 [000019] mathFN => $d0 {DblCns[2.645751]} N020 [000020] + => $d1 {DblCns[13.477573]} ... 21/29
  • 37. False sharing in action // It's an extremely na¨ıve benchmark // Don't try this at home int[] x = new int[1024]; void Inc(int p) { for (int i = 0; i < 10000001; i++) x[p]++; } void Run(int step) { var sw = Stopwatch.StartNew(); Task.WaitAll( Task.Factory.StartNew(() => Inc(0 * step)), Task.Factory.StartNew(() => Inc(1 * step)), Task.Factory.StartNew(() => Inc(2 * step)), Task.Factory.StartNew(() => Inc(3 * step))); WriteLine(sw.ElapsedMilliseconds); } 25/29
  • 38. False sharing in action // It's an extremely na¨ıve benchmark // Don't try this at home int[] x = new int[1024]; void Inc(int p) { for (int i = 0; i < 10000001; i++) x[p]++; } void Run(int step) { var sw = Stopwatch.StartNew(); Task.WaitAll( Task.Factory.StartNew(() => Inc(0 * step)), Task.Factory.StartNew(() => Inc(1 * step)), Task.Factory.StartNew(() => Inc(2 * step)), Task.Factory.StartNew(() => Inc(3 * step))); WriteLine(sw.ElapsedMilliseconds); } Run(1) Run(256) ≈400ms ≈150ms 25/29
  • 39. Conclusion: Benchmarking is hard Anon et al., “A Measure of Transaction Processing Power” There are lies, damn lies and then there are performance measures. 26/29