WHERETHEWILD
THINGS ARE
Benchmarking and
Micro-Optimisations
Matt Warren
@matthewwarren
http://mattwarren.org/
Premature Optimization
“We should forget about small efficiencies, say
about 97% of the time: premature
optimization is the root of all evil.Yet we
should not pass up our opportunities in that
critical 3%.“
- Donald Knuth
ProfilingTools
• ANTS Performance Profiler - Redgate
• dotTrace & dotMemory - Jet Brains
• PerfView - Microsoft (free)
• Visual Studio Profiling Tools (Ultimate, Premium or Professional)
• MiniProfiler - Stack Overflow (free)
BenchmarkDotNet
Why do you need a
benchmarking library?
static void Profile(int iterations, Action action)
{
action(); // warm up
GC.Collect(); // clean up
var watch = new Stopwatch();
watch.Start();
for (int i = 0; i < iterations; i++)
{
action();
}
watch.Stop();
Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}
Benchmarking small code samples in C#, can this implementation be improved?
http://stackoverflow.com/q/1047218/4500
private static T Result;
static void Profile<T>(int iterations, Func<T> func)
{
func(); // warm up
GC.Collect(); // clean up
var watch = new Stopwatch();
watch.Start();
for (int i = 0; i < iterations; i++)
{
Result = func();
}
watch.Stop();
Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}
Benchmarking small code samples in C#, can this implementation be improved?
http://stackoverflow.com/q/1047218/4500
BenchmarkDotNet project
Andrey Akinshin (the ‘Boss’)
@andrey_akinshin
http://aakinshin.net/en/blog/
Matt Warren (me)
Adam Sitnik (.NET Core guru)
@SitnikAdam
http://adamsitnik.com/
.NET
Foundation
Goals of BenchmarkDotNet
Benchmarking library that is:
•Accurate
•Easy-to-use
•Helpful
Benchmarking library that is:
•Accurate
•Easy-to-use
•Helpful
Stopwatch under the hood http://aakinshin.net/en/blog/dotnet/stopwatch/
LegacyJIT-x86 and first method call http://aakinshin.net/en/blog/dotnet/legacyjitx86-and-first-method-call/
Goals of BenchmarkDotNet
Proper docs!
benchmarkdotnet.org/
What BenchmarkDotNet doesn’t do
•Multi-threaded benchmarks
•Integrate with C.I builds
•Unit test runner integration
•Anything else?
http://github.com/dotnet/BenchmarkDotNet/issues/
“Other Benchmarking tools are available”
• NBench
• https://github.com/petabridge/NBench
• Microsoft Xunit performance
• http://github.com/Microsoft/xunit-performance/
• Lambda Micro Benchmarking (“Clash of the Lambdas”)
• https://github.com/biboudis/LambdaMicrobenchmarking
• Etimo.Benchmarks
• http://etimo.se/blog/etimo-benchmarks-lightweight-net-benchmark-tool/
• MeasureIt
• https://blogs.msdn.microsoft.com/vancem/2009/02/06/measureit-update-tool-for-
doing-microbenchmarks-for-net/
How it works
An invocation of the target method is an operation.
A bunch of operations is an iteration.
Iteration types:
• Pilot:The best operation count will be chosen.
• IdleWarmup, IdleTarget: BenchmarkDotNet overhead will be evaluated.
• MainWarmup:Warmup of the main method.
• MainTarget: Main measurements.
• Result = MainTarget – AverageOverhead
http://benchmarkdotnet.org/HowItWorks.htm
What happens under the covers?
Image credit Albert Rodríguez @UncleFirefox
DEMO
‘Hello World’ Benchmark
Scale of benchmarks
•millisecond - ms
• One thousandth of one second, single webapp request
•microsecond - us or µs
• One millionth of one second, several in-memory operations
•nanosecond - ns
• One billionth of one second, single operations
Who ‘times’ the timers?
[Benchmark]
public long StopwatchLatency()
{
return Stopwatch.GetTimestamp();
}
[Benchmark]
public long StopwatchGranularity()
{
// Loop until Stopwatch.GetTimestamp()
// gives us a different value
long lastTimestamp =
Stopwatch.GetTimestamp();
while (Stopwatch.GetTimestamp() ==
lastTimestamp)
{
}
return lastTimestamp;
}
[Benchmark]
public long DateTimeLatency()
{
return DateTime.Now.Ticks;
}
[Benchmark]
public long DateTimeGranularity()
{
// Loop until DateTime.Now
// gives us a different value
long lastTimestamp = DateTime.Now.Ticks;
while (DateTime.Now.Ticks == lastTimestamp)
{
}
return lastTimestamp;
}
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Allocated |
--------------------- |---------------- |------------ |---------- |
StopwatchLatency | ?? ns | ?? ns | ?? B |
StopwatchGranularity | ?? ns | ?? ns | ?? B |
DateTimeLatency | ?? ns | ?? ns | ?? B |
DateTimeGranularity | ?? ns | ?? ns | ?? B |
Who ‘times’ the timers?
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Allocated |
--------------------- |---------------- |------------ |---------- |
StopwatchLatency | 12.9960 ns | 0.1609 ns | 0 B |
StopwatchGranularity | 374.3049 ns | 2.4388 ns | 0 B |
DateTimeLatency | 682.2320 ns | 8.9341 ns | 32 B |
DateTimeGranularity | 996,025.6492 ns | 413.9175 ns | 47.34 kB |
Who ‘times’ the timers?
Loop-the-Loop
”Avoid foreach loop on everything except raw arrays?”
[Benchmark(Baseline = true)]
public int ForLoopArray()
{
var counter = 0;
for (int i = 0; i < anArray.Length; i++)
counter += anArray[i];
return counter;
}
[Benchmark]
public int ForEachArray()
{
var counter = 0;
foreach (var i in anArray)
counter += i;
return counter;
}
[Benchmark]
public int ForLoopList()
{
var counter = 0;
for (int i = 0; i < aList.Count; i++)
counter += aList[i];
return counter;
}
[Benchmark]
public int ForEachList()
{
var counter = 0;
foreach (var i in aList)
counter += i;
return counter;
}
Loop-the-Loop
”Avoid foreach loop on everything except raw arrays?”
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Scaled | Scaled-StdDev |
--------------- |-------------- |------------ |------- |-------------- |
ForLoopArray | ?? ns | | ?? | |
ForEachArray | ?? ns | | ?? | |
ForLoopList | ?? ns | | ?? | |
ForEachList | ?? ns | | ?? | |
Loop-the-Loop
”Avoid foreach loop on everything except raw arrays?”
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Scaled | Scaled-StdDev |
--------------- |-------------- |------------ |------- |-------------- |
ForLoopArray | 383.8279 ns | 2.9472 ns | 1.00 | 0.00 |
ForEachArray | 392.5611 ns | 4.1286 ns | 1.02 | 0.01 |
ForLoopList | 2,315.9658 ns | 12.1001 ns | 6.03 | 0.05 |
ForEachList | 2,663.5771 ns | 21.9822 ns | 6.94 | 0.08 |
Loop-the-Loop – ‘for loop’ - Arrays
Loop-the-Loop – ‘for loop’ - Lists
Abstractions - IDictionary v Dictionary
Dictionary<string, string> dictionary =
new Dictionary<string, string>();
IDictionary<string, string> iDictionary =
(IDictionary<string, string>)dictionary;
[Benchmark]
public Dictionary<string, string> DictionaryEnumeration()
{
foreach (var item in dictionary) { ; }
return dictionary;
}
[Benchmark]
public IDictionary<string, string> IDictionaryEnumeration()
{
foreach (var item in iDictionary) { ; }
return iDictionary;
}
Abstractions - IDictionary v Dictionary
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdErr | StdDev | Gen 0 | Allocated |
----------------------- |----------- |---------- |---------- |------- |---------- |
DictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B |
IDictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B |
// * Diagnostic Output - MemoryDiagnoser *
Note: the Gen 0/1/2 Measurements are per 1k Operations
Abstractions - IDictionary v Dictionary
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdErr | StdDev | Gen 0 | Allocated |
----------------------- |----------- |---------- |---------- |------- |---------- |
DictionaryEnumeration | 24.0353 ns | 0.2403 ns | 0.9307 ns | - | 0 B |
IDictionaryEnumeration | 41.6301 ns | 0.4479 ns | 2.1944 ns | 0.0086 | 32 B |
// * Diagnostic Output - MemoryDiagnoser *
Note: the Gen 0/1/2 Measurements are per 1k Operations
Abstractions - IDictionary v Dictionary
Dictionary<string, string> dictionary =
new Dictionary<string, string>();
IDictionary<string, string> iDictionary =
(IDictionary<string, string>)dictionary;
// struct – so doesn't allocate
Dictionary<string, string>.Enumerator enumerator =
dictionary.GetEnumerator();
// interface - allocates 56 B (64-bit) and 32 B (32-bit)
IEnumerator<KeyValuePair<string, string>> enumerator =
iDictionary.GetEnumerator();
Low-level increments
[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]
public class Program
{
private double a, b, c, d;
[Benchmark(OperationsPerInvoke = 4)]
public void MethodA()
{
a++; b++; c++; d++;
}
[Benchmark(OperationsPerInvoke = 4)]
public void MethodB()
{
a++; a++; a++; a++;
}
}
Low-level increments
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |
----------- |------------- |---------- |--------- |---------- |---------- |---------- |
Parallel | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |
Sequential | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |
Parallel | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
Sequential | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
Parallel | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
Sequential | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
MethodA = Parallel, MethodB() = Sequential
Low-level increments
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |
----------- |------------- |---------- |--------- |---------- |---------- |---------- |
Parallel | LegacyJitX64 | LegacyJit | X64 | 0.3420 ns | 0.0015 ns | 0.0057 ns |
Sequential | LegacyJitX64 | LegacyJit | X64 | 2.2038 ns | 0.0014 ns | 0.0051 ns |
Parallel | LegacyJitX86 | LegacyJit | X86 | 0.3276 ns | 0.0005 ns | 0.0020 ns |
Sequential | LegacyJitX86 | LegacyJit | X86 | 2.5229 ns | 0.0048 ns | 0.0187 ns |
Parallel | RyuJitX64 | RyuJit | X64 | 0.3686 ns | 0.0037 ns | 0.0144 ns |
Sequential | RyuJitX64 | RyuJit | X64 | 0.8959 ns | 0.0023 ns | 0.0090 ns |
MethodA = Parallel, MethodB() = Sequential
http://en.wikipedia.org/wiki/Instruction-level_parallelism
Search - Linear v Binary
private static int LinearSearch(
Data[] set, int key)
{
for (int i = 0; i < set.Length; i++)
{
var c = set[i].Key - key;
if (c == 0)
{
return i;
}
if (c > 0)
{
return ~i;
}
}
return ~set.Length;
}
private static int BinarySearch(
Data[] set, int key)
{
int i = 0;
int up = set.Length - 1;
while (i <= up)
{
int mid = (up - i) / 2 + i;
int c = set[mid].Key - key;
if (c == 0)
{
return mid;
}
if (c < 0)
i = mid + 1;
else
up = mid - 1;
}
return ~i;
}
Search - Linear v Binary
private readonly Data[][] dataSet;
private Data[] currentSet;
private int currentMid;
private int currentMax;
[Params(1, 2, 3, 4, 5, 7, 10, 12, 15)]
public int Size
{
set
{
currentSet = dataSet[value];
currentMax = value - 1;
currentMid = value / 2;
}
}
Linear
Search
v
Binary
Search
Linear
Search
v
Binary
Search
readonly fields
public struct Int256
{
private readonly long bits0, bits1,
bits2, bits3;
public Int256(long bits0, long bits1,
long bits2, long bits3)
{
this.bits0 = bits0; this.bits1 = bits1;
this.bits2 = bits2; this.bits3 = bits3;
}
public long Bits0 { get { return bits0; } }
public long Bits1 { get { return bits1; } }
public long Bits2 { get { return bits2; } }
public long Bits3 { get { return bits3; } }
}
private readonly Int256 readOnlyField =
new Int256(1L, 5L, 10L, 100L);
private Int256 field =
new Int256(1L, 5L, 10L, 100L);
[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]
public class Program
{
[Benchmark]
public long GetValue()
{
return field.Bits0 + field.Bits1 +
field.Bits2 + field.Bits3;
}
[Benchmark]
public long GetReadOnlyValue()
{
return readOnlyField.Bits0 +
readOnlyField.Bits1 +
readOnlyField.Bits2 +
readOnlyField.Bits3;
}
}
readonly fields
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |
----------------- |------------- |---------- |--------- |---------- |---------- |---------- |
GetValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |
GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |
GetValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
GetValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
readonly fields
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |
----------------- |------------- |---------- |--------- |---------- |---------- |---------- |
GetValue | LegacyJitX64 | LegacyJit | X64 | 0.7893 ns | 0.0078 ns | 0.0291 ns |
GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | 9.5362 ns | 0.0251 ns | 0.0971 ns |
GetValue | LegacyJitX86 | LegacyJit | X86 | 1.4625 ns | 0.0506 ns | 0.1959 ns |
GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | 1.9743 ns | 0.0641 ns | 0.2481 ns |
GetValue | RyuJitX64 | RyuJit | X64 | 0.3852 ns | 0.0183 ns | 0.0710 ns |
GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | 9.6406 ns | 0.0803 ns | 0.3109 ns |
https://codeblog.jonskeet.uk/2014/07/16/micro-optimization-the-surprising-inefficiency-of-readonly-fields/
MOAR Benchmarks!!
Analysing Optimisations in the Wire Serialiser
• http://mattwarren.org/2016/08/23/Analysing-Optimisations-in-the-Wire-Serialiser/
Optimising LINQ
• http://mattwarren.org/2016/09/29/Optimising-LINQ/
Why is reflection slow?
• http://mattwarren.org/2016/12/14/Why-is-Reflection-slow/
Why Exceptions should be Exceptional
• http://mattwarren.org/2016/12/20/Why-Exceptions-should-be-Exceptional/
Resources
QUESTIONS?

Where the wild things are - Benchmarking and Micro-Optimisations

  • 1.
  • 2.
  • 3.
    Premature Optimization “We shouldforget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%.“ - Donald Knuth
  • 4.
    ProfilingTools • ANTS PerformanceProfiler - Redgate • dotTrace & dotMemory - Jet Brains • PerfView - Microsoft (free) • Visual Studio Profiling Tools (Ultimate, Premium or Professional) • MiniProfiler - Stack Overflow (free)
  • 8.
  • 9.
    Why do youneed a benchmarking library?
  • 10.
    static void Profile(intiterations, Action action) { action(); // warm up GC.Collect(); // clean up var watch = new Stopwatch(); watch.Start(); for (int i = 0; i < iterations; i++) { action(); } watch.Stop(); Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds); } Benchmarking small code samples in C#, can this implementation be improved? http://stackoverflow.com/q/1047218/4500
  • 11.
    private static TResult; static void Profile<T>(int iterations, Func<T> func) { func(); // warm up GC.Collect(); // clean up var watch = new Stopwatch(); watch.Start(); for (int i = 0; i < iterations; i++) { Result = func(); } watch.Stop(); Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds); } Benchmarking small code samples in C#, can this implementation be improved? http://stackoverflow.com/q/1047218/4500
  • 12.
    BenchmarkDotNet project Andrey Akinshin(the ‘Boss’) @andrey_akinshin http://aakinshin.net/en/blog/ Matt Warren (me) Adam Sitnik (.NET Core guru) @SitnikAdam http://adamsitnik.com/
  • 13.
  • 14.
    Goals of BenchmarkDotNet Benchmarkinglibrary that is: •Accurate •Easy-to-use •Helpful
  • 15.
    Benchmarking library thatis: •Accurate •Easy-to-use •Helpful Stopwatch under the hood http://aakinshin.net/en/blog/dotnet/stopwatch/ LegacyJIT-x86 and first method call http://aakinshin.net/en/blog/dotnet/legacyjitx86-and-first-method-call/ Goals of BenchmarkDotNet
  • 16.
  • 17.
    What BenchmarkDotNet doesn’tdo •Multi-threaded benchmarks •Integrate with C.I builds •Unit test runner integration •Anything else? http://github.com/dotnet/BenchmarkDotNet/issues/
  • 18.
    “Other Benchmarking toolsare available” • NBench • https://github.com/petabridge/NBench • Microsoft Xunit performance • http://github.com/Microsoft/xunit-performance/ • Lambda Micro Benchmarking (“Clash of the Lambdas”) • https://github.com/biboudis/LambdaMicrobenchmarking • Etimo.Benchmarks • http://etimo.se/blog/etimo-benchmarks-lightweight-net-benchmark-tool/ • MeasureIt • https://blogs.msdn.microsoft.com/vancem/2009/02/06/measureit-update-tool-for- doing-microbenchmarks-for-net/
  • 19.
    How it works Aninvocation of the target method is an operation. A bunch of operations is an iteration. Iteration types: • Pilot:The best operation count will be chosen. • IdleWarmup, IdleTarget: BenchmarkDotNet overhead will be evaluated. • MainWarmup:Warmup of the main method. • MainTarget: Main measurements. • Result = MainTarget – AverageOverhead http://benchmarkdotnet.org/HowItWorks.htm
  • 20.
    What happens underthe covers? Image credit Albert Rodríguez @UncleFirefox
  • 21.
  • 22.
    Scale of benchmarks •millisecond- ms • One thousandth of one second, single webapp request •microsecond - us or µs • One millionth of one second, several in-memory operations •nanosecond - ns • One billionth of one second, single operations
  • 23.
    Who ‘times’ thetimers? [Benchmark] public long StopwatchLatency() { return Stopwatch.GetTimestamp(); } [Benchmark] public long StopwatchGranularity() { // Loop until Stopwatch.GetTimestamp() // gives us a different value long lastTimestamp = Stopwatch.GetTimestamp(); while (Stopwatch.GetTimestamp() == lastTimestamp) { } return lastTimestamp; } [Benchmark] public long DateTimeLatency() { return DateTime.Now.Ticks; } [Benchmark] public long DateTimeGranularity() { // Loop until DateTime.Now // gives us a different value long lastTimestamp = DateTime.Now.Ticks; while (DateTime.Now.Ticks == lastTimestamp) { } return lastTimestamp; }
  • 24.
    BenchmarkDotNet=v0.10.1, OS=Microsoft WindowsNT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Method | Mean | StdDev | Allocated | --------------------- |---------------- |------------ |---------- | StopwatchLatency | ?? ns | ?? ns | ?? B | StopwatchGranularity | ?? ns | ?? ns | ?? B | DateTimeLatency | ?? ns | ?? ns | ?? B | DateTimeGranularity | ?? ns | ?? ns | ?? B | Who ‘times’ the timers?
  • 25.
    BenchmarkDotNet=v0.10.1, OS=Microsoft WindowsNT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Method | Mean | StdDev | Allocated | --------------------- |---------------- |------------ |---------- | StopwatchLatency | 12.9960 ns | 0.1609 ns | 0 B | StopwatchGranularity | 374.3049 ns | 2.4388 ns | 0 B | DateTimeLatency | 682.2320 ns | 8.9341 ns | 32 B | DateTimeGranularity | 996,025.6492 ns | 413.9175 ns | 47.34 kB | Who ‘times’ the timers?
  • 26.
    Loop-the-Loop ”Avoid foreach loopon everything except raw arrays?” [Benchmark(Baseline = true)] public int ForLoopArray() { var counter = 0; for (int i = 0; i < anArray.Length; i++) counter += anArray[i]; return counter; } [Benchmark] public int ForEachArray() { var counter = 0; foreach (var i in anArray) counter += i; return counter; } [Benchmark] public int ForLoopList() { var counter = 0; for (int i = 0; i < aList.Count; i++) counter += aList[i]; return counter; } [Benchmark] public int ForEachList() { var counter = 0; foreach (var i in aList) counter += i; return counter; }
  • 27.
    Loop-the-Loop ”Avoid foreach loopon everything except raw arrays?” BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Method | Mean | StdDev | Scaled | Scaled-StdDev | --------------- |-------------- |------------ |------- |-------------- | ForLoopArray | ?? ns | | ?? | | ForEachArray | ?? ns | | ?? | | ForLoopList | ?? ns | | ?? | | ForEachList | ?? ns | | ?? | |
  • 28.
    Loop-the-Loop ”Avoid foreach loopon everything except raw arrays?” BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Method | Mean | StdDev | Scaled | Scaled-StdDev | --------------- |-------------- |------------ |------- |-------------- | ForLoopArray | 383.8279 ns | 2.9472 ns | 1.00 | 0.00 | ForEachArray | 392.5611 ns | 4.1286 ns | 1.02 | 0.01 | ForLoopList | 2,315.9658 ns | 12.1001 ns | 6.03 | 0.05 | ForEachList | 2,663.5771 ns | 21.9822 ns | 6.94 | 0.08 |
  • 29.
    Loop-the-Loop – ‘forloop’ - Arrays
  • 30.
    Loop-the-Loop – ‘forloop’ - Lists
  • 31.
    Abstractions - IDictionaryv Dictionary Dictionary<string, string> dictionary = new Dictionary<string, string>(); IDictionary<string, string> iDictionary = (IDictionary<string, string>)dictionary; [Benchmark] public Dictionary<string, string> DictionaryEnumeration() { foreach (var item in dictionary) { ; } return dictionary; } [Benchmark] public IDictionary<string, string> IDictionaryEnumeration() { foreach (var item in iDictionary) { ; } return iDictionary; }
  • 32.
    Abstractions - IDictionaryv Dictionary BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Method | Mean | StdErr | StdDev | Gen 0 | Allocated | ----------------------- |----------- |---------- |---------- |------- |---------- | DictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B | IDictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B | // * Diagnostic Output - MemoryDiagnoser * Note: the Gen 0/1/2 Measurements are per 1k Operations
  • 33.
    Abstractions - IDictionaryv Dictionary BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 Method | Mean | StdErr | StdDev | Gen 0 | Allocated | ----------------------- |----------- |---------- |---------- |------- |---------- | DictionaryEnumeration | 24.0353 ns | 0.2403 ns | 0.9307 ns | - | 0 B | IDictionaryEnumeration | 41.6301 ns | 0.4479 ns | 2.1944 ns | 0.0086 | 32 B | // * Diagnostic Output - MemoryDiagnoser * Note: the Gen 0/1/2 Measurements are per 1k Operations
  • 34.
    Abstractions - IDictionaryv Dictionary Dictionary<string, string> dictionary = new Dictionary<string, string>(); IDictionary<string, string> iDictionary = (IDictionary<string, string>)dictionary; // struct – so doesn't allocate Dictionary<string, string>.Enumerator enumerator = dictionary.GetEnumerator(); // interface - allocates 56 B (64-bit) and 32 B (32-bit) IEnumerator<KeyValuePair<string, string>> enumerator = iDictionary.GetEnumerator();
  • 35.
    Low-level increments [LegacyJitX86Job, LegacyJitX64Job,RyuJitX64Job] public class Program { private double a, b, c, d; [Benchmark(OperationsPerInvoke = 4)] public void MethodA() { a++; b++; c++; d++; } [Benchmark(OperationsPerInvoke = 4)] public void MethodB() { a++; a++; a++; a++; } }
  • 36.
    Low-level increments BenchmarkDotNet=v0.10.1, OS=MicrosoftWindows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0 LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0 Runtime=Clr Allocated=0 B Method | Job | Jit | Platform | Mean | StdErr | StdDev | ----------- |------------- |---------- |--------- |---------- |---------- |---------- | Parallel | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns | Sequential | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns | Parallel | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns | Sequential | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns | Parallel | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns | Sequential | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns | MethodA = Parallel, MethodB() = Sequential
  • 37.
    Low-level increments BenchmarkDotNet=v0.10.1, OS=MicrosoftWindows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0 LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0 Runtime=Clr Allocated=0 B Method | Job | Jit | Platform | Mean | StdErr | StdDev | ----------- |------------- |---------- |--------- |---------- |---------- |---------- | Parallel | LegacyJitX64 | LegacyJit | X64 | 0.3420 ns | 0.0015 ns | 0.0057 ns | Sequential | LegacyJitX64 | LegacyJit | X64 | 2.2038 ns | 0.0014 ns | 0.0051 ns | Parallel | LegacyJitX86 | LegacyJit | X86 | 0.3276 ns | 0.0005 ns | 0.0020 ns | Sequential | LegacyJitX86 | LegacyJit | X86 | 2.5229 ns | 0.0048 ns | 0.0187 ns | Parallel | RyuJitX64 | RyuJit | X64 | 0.3686 ns | 0.0037 ns | 0.0144 ns | Sequential | RyuJitX64 | RyuJit | X64 | 0.8959 ns | 0.0023 ns | 0.0090 ns | MethodA = Parallel, MethodB() = Sequential http://en.wikipedia.org/wiki/Instruction-level_parallelism
  • 38.
    Search - Linearv Binary private static int LinearSearch( Data[] set, int key) { for (int i = 0; i < set.Length; i++) { var c = set[i].Key - key; if (c == 0) { return i; } if (c > 0) { return ~i; } } return ~set.Length; } private static int BinarySearch( Data[] set, int key) { int i = 0; int up = set.Length - 1; while (i <= up) { int mid = (up - i) / 2 + i; int c = set[mid].Key - key; if (c == 0) { return mid; } if (c < 0) i = mid + 1; else up = mid - 1; } return ~i; }
  • 39.
    Search - Linearv Binary private readonly Data[][] dataSet; private Data[] currentSet; private int currentMid; private int currentMax; [Params(1, 2, 3, 4, 5, 7, 10, 12, 15)] public int Size { set { currentSet = dataSet[value]; currentMax = value - 1; currentMid = value / 2; } }
  • 40.
  • 41.
  • 42.
    readonly fields public structInt256 { private readonly long bits0, bits1, bits2, bits3; public Int256(long bits0, long bits1, long bits2, long bits3) { this.bits0 = bits0; this.bits1 = bits1; this.bits2 = bits2; this.bits3 = bits3; } public long Bits0 { get { return bits0; } } public long Bits1 { get { return bits1; } } public long Bits2 { get { return bits2; } } public long Bits3 { get { return bits3; } } } private readonly Int256 readOnlyField = new Int256(1L, 5L, 10L, 100L); private Int256 field = new Int256(1L, 5L, 10L, 100L); [LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job] public class Program { [Benchmark] public long GetValue() { return field.Bits0 + field.Bits1 + field.Bits2 + field.Bits3; } [Benchmark] public long GetReadOnlyValue() { return readOnlyField.Bits0 + readOnlyField.Bits1 + readOnlyField.Bits2 + readOnlyField.Bits3; } }
  • 43.
    readonly fields BenchmarkDotNet=v0.10.1, OS=MicrosoftWindows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0 LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0 Runtime=Clr Allocated=0 B Method | Job | Jit | Platform | Mean | StdErr | StdDev | ----------------- |------------- |---------- |--------- |---------- |---------- |---------- | GetValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns | GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns | GetValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns | GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns | GetValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns | GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
  • 44.
    readonly fields BenchmarkDotNet=v0.10.1, OS=MicrosoftWindows NT 6.1.7601 Service Pack 1 Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8 Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0 LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0 RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0 Runtime=Clr Allocated=0 B Method | Job | Jit | Platform | Mean | StdErr | StdDev | ----------------- |------------- |---------- |--------- |---------- |---------- |---------- | GetValue | LegacyJitX64 | LegacyJit | X64 | 0.7893 ns | 0.0078 ns | 0.0291 ns | GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | 9.5362 ns | 0.0251 ns | 0.0971 ns | GetValue | LegacyJitX86 | LegacyJit | X86 | 1.4625 ns | 0.0506 ns | 0.1959 ns | GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | 1.9743 ns | 0.0641 ns | 0.2481 ns | GetValue | RyuJitX64 | RyuJit | X64 | 0.3852 ns | 0.0183 ns | 0.0710 ns | GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | 9.6406 ns | 0.0803 ns | 0.3109 ns | https://codeblog.jonskeet.uk/2014/07/16/micro-optimization-the-surprising-inefficiency-of-readonly-fields/
  • 45.
    MOAR Benchmarks!! Analysing Optimisationsin the Wire Serialiser • http://mattwarren.org/2016/08/23/Analysing-Optimisations-in-the-Wire-Serialiser/ Optimising LINQ • http://mattwarren.org/2016/09/29/Optimising-LINQ/ Why is reflection slow? • http://mattwarren.org/2016/12/14/Why-is-Reflection-slow/ Why Exceptions should be Exceptional • http://mattwarren.org/2016/12/20/Why-Exceptions-should-be-Exceptional/
  • 46.
  • 47.

Editor's Notes

  • #15 This is what we aim for This is why we wanted to build BenchmarkDotNet
  • #23 Aim is to make a framework that can accurately measure milli, micro and nano benchmarks. But in reality the main use-cases are probably milli/micro benchmarks, so these must work above all else!!  (Even lower down!!!) picoseconds: 1…1000 ps One trillionth of one second, pipelining
  • #27 Avoid foreach loop on everything except raw arrays