Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov

Yevhen Tatarynov
Software developer with 15 years of experience in commercial
software and database development (.NET / MS SQL / Delphi)
PhD in math, specializing in the theoretical foundations of
computer science and cybernetics
I was involved in projects performing complex mathematical calculations and processing large
amounts of data. For now my role senior software developer in infrastructure team, Covent IT.
Point of professional interest:
application performance optimization and analysis
writing C# code similar in performance to C++
advanced debugging

Agenda
What is the
challenge?
Measurements Is it an issue?
QA
Intermediate
results
Summary
Can it be
faster?

WinForms .NET Applications
Read *.bin, *.txt data
ﬁles
“Process bits”, extract
and use full data, pack
into new format
Write results in text
and binary ﬁles
Use .NET Framework
4.0
Run on Windows 10
x64
It works correctly

Environment
Windows 10 x64 19042.804
CPU – Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz
SSD – Kingston SKC400S37256G (Read 550MB/s / Write 540 MB/s)

Non-Functional Requirements
We need to process files > 1GB (size of processed files can increase significantly)
Application will run on personal laptops
Processing data should be very fast to obtain a result in an appropriate
amount of time

Disappointing Forecast
File size
~(363+236)MB
Execution time
Peak memory
load
Total memory
allocation
~(415+178)MB
> (1 + 0.25|0.5)GB
~ 11.5 min.
~45.5 min
~89.28 MB
~93.26 MB
~132,584 MB
~374,011 MB

What Is the Challenge?
Goals Values
Reduce execution time
Don’t increase total
memory allocation
Processing ﬁnished in
appropriate time
Don’t break the
application output

Choose Metrics
Execution time – main metric
Total memory allocation – try not to increase
Memory load peak – low priority

Measurement Tools
DateTime.Now StopWatch DotNet
Benchmark
Perfview Visual Studio
Performance Proﬁler
R# dotTrace RedGate
Performance
monitor

dotTrace TimeLine Sample & Snapshot Execution Time

dotTrace TimeLine Snapshot Execution Time

Sequential Runs Impact
Execution time HDD Execution time SSD
1st
2nd
Diff
%
2,806,777 ms (46 m 47 s 777 ms) 2,742,178 ms (45 m 42 s 178 ms)
2,744,085 ms (45 m 44 s 085 ms) 2,601,642 ms (43 m 21 s 642 ms)
62,692 ms (01 m 03 s 692 ms) 140,536 ms (02 m 20 s 536 ms)
2,23 % 5,12 %

Experiment Results Breakage Factors
Drive read / write speed, cache, type (HDD, SSD)
CPU base frequency, cache, burn time, turbo boost, power supply schema
Anti-malware software
Scheduled process, system updates
System ﬁle cache, ﬁle fragmentation
Total system load (CPU, RAM, Drive)

Experiment Results Breakage Factors
Disable anti-malware software, system schedules, system updates
Set high-performance power supply schema if you are using a laptop
Restart OS to end redundant processes and clear caches
Wait until it's fully loaded
Run application on SSD drive types
Run application twice on the same input data
In the analysis, we will use the results from the 2nd execution

Potential Improvements
Used .ToArray() - slow
Concat use foreach and extra
memory to iterate input params
Each time we produce new byte array.
Redundant memory trafﬁc.
var b = new byte[];
for (int i = 0; i < N; i++)
{
byte[] a = new byte[GetLen(i)];
/* fill a with values */
b = b.Concat(a).ToArray();
}
return b;

Linq.Concat
static IEnumerable<TSource> Concat<TSource>(this IEnumerable<TSource> first,
IEnumerable<TSource> second) {
if (first == null) throw Error.ArgumentNull("first");
if (second == null) throw Error.ArgumentNull("second");
return ConcatIterator<TSource>(first, second);
}
static IEnumerable<TSource> ConcatIterator<TSource>(IEnumerable<TSource> first,
IEnumerable<TSource> second){
foreach (TSource element in first) yield return element;
foreach (TSource element in second) yield return element;

.ToArray()
static TSource[] ToArray<TSource>(this IEnumerable<TSource> source) {
if (source == null) throw Error.ArgumentNull("source");
return new Buffer<TSource>(source).ToArray();
}
internal TElement[] ToArray() {
if (count == 0) return new TElement[0];
if (items.Length == count) return items;
TElement[] result = new TElement[count];
Array.Copy(items, 0, result, 0, count);
return result;

Solution
var b = new byte[maxN]; var bN=0;
for (int i = 0; i < N; i++)
{
var aN = GetLen(i);
byte[] a = new
byte[GetLen(i)];
Buffer.BlockCopy(aN,0,b,bN,a);
bN += aN;
}
return b;
Use one array to store
array concatenation.
To copy array into result,
use Buffer.BlockCopy -
faster than Array.Copy .

Buffer.BlockCopy
public static void BlockCopy
(Array src, int srcOffset, Array
dst, int dstOffset,
int count);
Copies a speciﬁed number of bytes from a
source array starting at a particular offset to a
destination array starting at a particular offset.
● src – Array The source buffer.
● srcOffset - Int32 The
zero-based byte offset into src.
● dst – Array The destination
buffer.
● dstOffset - Int32 The
zero-based byte offset into
dst.
● count - Int32 The number of
bytes to copy.

Comparison
var b = new byte[];
for (int i = 0; i < N; i++)
{
byte[] a = new byte[GetLen(i)];
b = b.Concat(a).ToArray();
}
return b;
var b = new byte[maxN]; var bN=0;
for (int i = 0; i < N; i++)
{
var aN = GetLen(i);
byte[] a = new byte[aN];
Buffer.BlockCopy(aN,0,b,bN,aN);
bN += aN;
}
return b;

#1 Performance Summary
Execution time
1st 374,011
120,333
88,61 %
-38 m 25 s 228 ms
Memory (MB)
2nd
Diff
%
43 m 21 s 642 ms
4 m 56 s 414 ms
-253,678
68,83 %
х 3,11
х 9,25

#2 Is GetBits an Issue?
/*
Returns last N (N > 0) bits from X in
byte array
*/
byte[] GetBits(uint x, byte n)

var a = new byte[n];
var D = (uint)Math.Pow(2, n-1);
for (var i = 0; i < n; i++)
{
a[i] = (byte)(x / D);
x -= a[i] * D;
D /= 2;
}
return a;
Math.Pow use ﬂoating point
operation to calculate result
Convert int param to double
Convert double result to uint
Use * and / operation

Optimizing Software in C++ by Agner Fog.
Execution time (clock cycles)
Operation Min Max
Min
floating point mul
floating point div
int add
int sub
int shift
int mul
int div
3
40
14
1
1
1
3
4
6
80
45
1
1
1
4
8
Conversion of signed integers to
floating point is fast only when the
SSE2 instruction set is enabled.
Conversion of unsigned integers to
floating point is faster only when the
AVX512
instruction set is enabled.
A conversion from floating point to
integer without SSE2 typically takes
40 clock cycles.

Solution
var i = n - 1;
while (x! = 0)
{
a[i] = (byte)(x & 1);
x = x >> 1;
i--;
}
return a;
Use binary representation on int
numbers
Don’t use Math.Pow
Use only integer operands to avoid
converting
Use bitwise operation >> and &

Comparison
var D = (uint)Math.Pow(2, n-1);
for (var i = 0; i < n; i++)
{
a[i] = (byte)(x / D);
x -= a[i] * D;
D /= 2;
}
return a;
var i = n - 1;
while (x!=0)
{
a[i] = (byte)(x & 1);
x = x >> 1;
i--;
}
return a;

Execution time
Old 120,333
120,333
0,00 %
-1 m 02 s 992 ms
Memory (MB)
New
Diff
%
4 m 56 s 414 ms
3 m 53 s 422 ms
0
21,25 %
х 1,00
х 1,27

#3 Is AnyPathHasIllegalCharacters an Issue?
GetFileName call indirrectly
AnyPathHasIllegalCharacters
GetFileName used in functionality
to sort file names (in format
Name_ddddd.txt) located in
specific folder. Files have to be
sorted in order ddddd name part.
The number of files > 13000
(Name_1.txt .. Name_13000.txt).

Bubble sort O(N2)
for (var i = 0; i < FileNameArray.Length - 1; i++)
for (var j = i + 1; j < FileNameArray.Length; j++)
{
var S1 = Path.GetFileName(FileNameArray[i]);
var i_N = int.Parse(S1.Substring(4,S1.LastIndexOf('.’)-4));
S1 = Path.GetFileName(FileNameArray[j]);
var j_N =int.Parse(S1.Substring(4,S1.LastIndexOf('.')-4));
if (i_N > j_N)
{
S1 = FileNameArray[i];
FileNameArray[i] = FileNameArray[j];
FileNameArray[j] = S1;
}
}
Redundant GetFileName
calls
Redundant Parse calls
Redundant Substring calls
Redundant LastIndex calls

Solution
Array.Sort(FileNameArray, new NumericComparer());
public class NumericComparer : IComparer
/* */
public int Compare(string x, string y)
{
var result = x.Length.CompareTo(y.Length);
if (result == 0)
return x.CompareTo(y);
return result;
}
}
Quick sort O(N log(N))
Don’t call GetFileName
Don’t call Substring calls
Don’t call Parse calls
Don’t call LastIndex calls

Comparison
for (var i = 0; i < FileNameArray.Length - 1; i++)
for (var j = i + 1; j < FileNameArray.Length; j++)
{
var S1 = Path.GetFileName(FileNameArray[i]);
var i_N = int.Parse(S1.Substring(4,S1.LastIndexOf('.’)-4));
S1 = Path.GetFileName(FileNameArray[j]);
var j_N = int.Parse(S1.Substring(4,S1.LastIndexOf('.')-4));
if (i_N > j_N)
{
S1 = FileNameArray[i];
FileNameArray[i] = FileNameArray[j];
FileNameArray[j] = S1;
}
}
Array.Sort(FileNameArray, new NumericComparer());
public class NumericComparer : IComparer
/* */
public int Compare(string x, string y)
{
var result = x.Length.CompareTo(y.Length);
if (result == 0)
return x.CompareTo(y);
return result;
}
}

Execution time
Old 120,333
103,775
13,76 %
-1 m 42 s 331 ms
Memory (MB)
New
Diff
%
3 m 53 s 422 ms
2 m 11 s 091 ms
16,558
43,84 %
х 1,16
х 1,78

#4 Is FileStream.get_Length an Issue?
In both cases, the binary ﬁle read and
the basis on read data is the
calculated number of binary chains.

using(var br = new BinaryReader(…))
{
while (br.BaseStream.Position <= br.BaseStream.Length - 4)
{
counter++;
br.ReadUInt32();
br.ReadUInt32();
var n = br.ReadUInt32();
for (int i = 0; i < n; i++) br.ReadUInt32();
}
}
Redundant Length calls
Redundant subtraction
Redundant call
ReadUInt32

Solution
{
var length = br.BaseStream.Length - 4;
while (br.BaseStream.Position <= length)
{
counter++;
br.ReadUInt64();
}
}
Store Length in local
variable
Call ReadUInt64 instead
ReadUInt32

Comparison
{
while (br.BaseStream.Position <= br.BaseStream.Length - 4)
{
counter++;
br.ReadUInt32();
br.ReadUInt32();
}
}
{
var length = br.BaseStream.Length - 4;
while (br.BaseStream.Position <= length)
{
counter++;
br.ReadUInt64();
}
}

Execution time
Old 103,775
103,775
0.00 %
-31 s 786 ms
Memory (MB)
New
Diff
%
2 m 11 s 091 ms
1 m 39 s 305 ms
0
24.25 %
х 1.00
х 1.32

#5 Is Math.Log an Issue?
/*
Calculate the number of bits to
store number X. For 0 it should
return 1,
00101110b it should return 6.
*/
byte NumberOfBits(uint X)

{
if (X > 0)
return (byte)Math.Ceiling(Math.Log(X+1,2));
return 1;
}
Math.Log use ﬂoating point
operation to calculate logarithm
Convert double result to byte
Convert uint to double param

Solution
{
var x = (int)X;
byte counter = 0;
do {
counter++;
x = x >> 1;
}
while (x != 0);
return counter;
}
Just calculate the number of bits
in X to avoid ﬂoating point
arithmetic with bitwise operation
Avoid type conversion
Avoid using Math.Log

Comparison
{
if (X > 0)
return (byte)Math.Ceiling(Math.Log(X+1,2));
return 1;
}
{
var x = (int)X;
byte counter = 0;
do {
counter++;
x = x >> 1;
}
while (x != 0);
return counter;
}

Execution time
Old 103,775
103,775
0.00 %
-29 s 138 ms
Memory (MB)
New
Diff
%
1 m 39 s 305 ms
1 m 10 s 167 ms
0
29.34 %
х 1.00
х 1.42

Performance Summary
Execution time
Old 374,011
103,775
72.25 %
- 42 m 11 s 575 ms
Memory (MB)
New
Diff
%
43 m 21 s 642 ms
1 m 10 s 167 ms
270,236
97.30 %
х 3.60
х 37.08

#6 packbits Is A Heavy Function.
Can It Be Faster?
/*
Calculate statistics made from ushort array
blocks and byte array of bits and pack them
into a new format.
*/
byte[] PackBits(ushort[] blocks,
byte numberOfBits,
bool CalcStatistics)

Potential Improvement #1
byte[] PackBits(ushort[] blocks, byte numberOfbits, bool сalcStatistics)
var stat = 0; var packedBits = new byte[maxN];
foreach (var block in blocks)
{
byte bits = NumberOfBits(block);
if (bits <= numberOfbits) {
/*branch1*/ }
else {
/*branch2*/
}
}
return b;
/*branch1*/
/*GetBits return byte array with numberOfbits for block*/
if (сalcStatistics) stat += numberOfbits - 1;
Buffer.BlockCopy(GetBits(block, numberOfbits), 0, packedBits,
n,numberOfbits);
n += numberOfbits;
index += numberOfbits;
/*branch2*/
if (сalcStatistics) stat += 5*numberOfbits;
n += tmp;
index += tmp;
Buffer.BlockCopy(GetBits(block, bits),0,packedBits,n, bits);
n += bits;
index += bits;

Unused Calculation Resuts
/*branch 1*/
Buffer.BlockCopy(GetBits(block, nbit), 0, b, n, nbit);
n = n + nbit;
/*branch 2*/
n = n + tmp;
index += tmp;
Buffer.BlockCopy(GetBits(block, bits), 0, b, n, bits);
n = n + bits;
index += bits;
Result of index calculation not used. Just remove it.

{
/*branch1*/ }
else {
/*branch2*/
}
}
return b;
/*branch1*/
Buffer.BlockCopy(GetBits(block, numberOfbits), 0, packedBits,
n,numberOfbits);
n += numberOfbits;
/*branch2*/
n += tmp;
index += tmp;
Buffer.BlockCopy(GetBits(block, bits),0,packedBits,n, bits);
n += bits;
index += bits;

Avoid Redundant Copy
Write bits inside GetBits to existing
array instead of producing a new one
byte[] GetBits(uint X,byte N,byte[] dest
,int offset)
var i = offset + N – 1;
while (X!=0)
{
dest[i] = (byte)(X & 1);
X = X >> 1;
i--;
}
return a;
Avoid using Buffer.BlockCopy

{
/*branch1*/ }
else {
/*branch2*/
}
}
return b;
/*branch1*/
GetBits (block, b , numberOfBits, n);
n += numberOfbits;
/*branch2*/
n += tmp;
index += tmp;
GetBits (block, b , bits, n);
n += bits;
index += bits;

Foreach vs. For to Iterate Array
It seems that the compiler is smart enough to make foreach a for
statement on JIT Asm level. But we prefer the simple for.
C.M(UInt16[]) /*foreach statement*/
L0000: mov rax, rdx
L0002: xor eax, eax
L0005: mov ecx, [rdx+0x8]
L0007: test ecx, ecx
L0009: jle L0018
L000c: movsxd r8, eax
L0012: movzx r8d, word [rdx+r8*2+0x10]
L0014: inc eax
L0016: cmp ecx, eax
L0018: jg L000c
L0020: ret
C.M(UInt16[]) /*for statement*/
L0000: xor eax, eax
L0002: mov ecx, [rdx+0x8]xor eax, eax
L0005: test ecx, ecx
L0007: jle L0018
L0009: movsxd r8, eax
L000c: movzx r8d, word [rdx+r8*2+0x10]
L0012: inc eax
L0014: cmp ecx, eax
L0016: jg L0009
L0018: ret

for (int i=0; i< blocks.length; i++)
{
/*branch1*/ }
else {
/*branch2*/
}
}
return b;
/*branch1*/
n += numberOfbits;
/*branch2*/
n += tmp;
index += tmp;
n += bits;
index += bits;

maxN Is Known
byte[] PackBits(ushort[] blocks, byte numberOfbits, bool сalcStatistics, int maxN);
int PackedBitsLength(ushort[] blocks, byte numberOfbits, bool сalcStatistics);
Upper bound maxN calculated before call PackBits; we can pass its function as a parameter
Often, PackBits calls just to get a result array length n variable in PackBits. So we can split it
in two functions.

byte[] PackBits(ushort[] blocks, byte numberOfbits, bool сalcStatistics, int maxN)
{
/*branch1*/ }
else {
/*branch2*/
}
}
return b;
/*branch1*/
n += numberOfbits;
/*branch2*/
n += tmp;
index += tmp;
n += bits;
index += bits;

We Don’t Need an If Statement
{
if (bits <= nbit) {/*branch 1*/
} else{/*branch 2*/
}
}
return b;
Statistics are calculated only
when we need to produce an
array.
So we can make a calculation
without the condition and
avoid using a branch predictor.
In PackedBitsLength, we can
remove these rows.

PackBits
byte[] PackBits(ushort[] blocks, byte numberOfbits, bool сalcStatistics, int maxN)
for(int i=0;i< blocks.length; i++)
{
/*branch1*/ }
else {
/*branch2*/
}
}
return b;
/*branch1*/
n += numberOfbits;
/*branch2*/
n += tmp;
index += tmp;
n += bits;
index += bits;

Execution time
Old 103,775
47,239
54.48 %
-19 s 974 ms
Memory (MB)
New
Diff
%
1 m 10 s 167 ms
50 s 193 ms
56,536
28.47 %
х 2.20
х 1.40

#7 Heavy function PackBits2.
CAN IT BE FASTER?
/*
Calculate statistics, make from uint list
bloks, byte array of bits and pack it in new
format.
*/
byte[] PackBits2(List<unit> blocks,
byte numberOfBits,
bool CalcStatistics)

Execution time
Old 47,239
10,510
77.75 %
-13 s 562 ms
Memory (MB)
New
Diff
%
50 s 193 ms
36 s 631 ms
26,729
27.02 %
х 4.49
х 1.37

#8 WriteBits - Can It Be Faster?
/* Write to the binary file the
given number of bits from UInt
and store bits which do not fit
into 8 bits in static array */
WriteBits(BinaryWriter bw, uint x
byte numberOfBits)

void WriteBits(BinaryWriter bw, uint x, byte numberOfBits)
byte[] bits = GetBits(x, numberOfBits);
byte[] tmpB = new byte[buffB.Length + bits.Length];
Buffer.BlockCopy(buffB, 0, tmpB, 0, buffB.Length);
Buffer.BlockCopy(bits, 0, tmpB, buffB.Length, bits.Length);
buffB = tmpB; // static byte array to store not writing bytes
for (int i = 0; i < buffB.Length / 8; i++)
{ int j = i * 8;
іbw.Write((byte)(buffB[j]*128+buffB[j+1]*64+buffB[j+2]*32+buffB[j+3]*16+buffB[j+4]*8+buffB[j+5]*4+buffB[j+6]*2+buffB[j+7]))
;
}
int L = (buffB.Length / 8) * 8;
for (int i = L; i < buffB.Length; i++) buffB[i - L] = buffB[i];
Array.Resize(ref buffB, buffB.Length - L);

Buffer.BlockCopy Can Be Inefﬁcient
numberOfBits parameter can’t be more then 32 because we operate with Int32 and Uint32 numbers,
so the bit tail can’t be more than 8 bits. We need to store no more than 40 bits (in a 40-byte array).
Thus, copying an array with Buffer.BlockCopy is not so efﬁcient. Replace it with a simple copy element
in the loop.

for (int i = 0; i < buffB.Length; i++) tmpB[i] = buffB[i];
for (int i = 0; i < bits.Length; i++) tmpB[i+ BitsBuffLength] = a[i];
for (int i = 0; i < buffB.Length / 8; i++)
{ int j = i * 8;
bw.Write((byte)(buffB[j]*128+buffB[j+1]*64+buffB[j+2]*32+buffB[j+3]*16+buffB[j+4]*8+buffB[j+5]*4+buffB[j+6]*2+buffB[j+7]));
}
int L = (buffB.Length / 8) * 8;

Optimizing Software in C++ By Agner Fog.

for (int i = 0; i < buffB.Length >> 3; i++)
{ int j = i << 3;
bw.Write((byte)(buffB[j]<<7+buffB[j+1]<<6+buffB[j+2]<<5+buffB[j+3]<<4+buffB[j+4]<<3+buffB[j+5]<<2+buffB[j+6]<<1+buffB[j+7]));
}
int L = (buffB.Length >> 8) << 8;

Buffer Array to Store Bits Tail
We don’t need to create a new array each time, and instead just reuse the existing buffer length of 8.
Each time, we will store its current bits tail length so we can avoid using Array.Resize()

WriteBits
for (int i = 0; i < buffB.Length >> 3; i++)
{ int j = i << 3;
bw.Write((byte)(buffB[j]<<7+buffB[j+1]<<6+buffB[j+2]<<5+buffB[j+3]<<4+buffB[j+4]<<3+buffB[j+5]<<2+buffB[j+6]<<1+buffB[j+7]));
}
BitsBuffLength = Bits_ tmpB.Length - L;

Execution time
Old 10,510
9,754
7.19 %
-4 s 080 ms
Memory (MB)
New
Diff
%
36 s 631 ms
32 s 551 ms
756
11.14 %
х 1.08
х 1.13

#9 ScaleGrad. - Can It Be Faster?
/*
Return index of number x
by ordered scale
*/
int ScaleGrad(int x)

Avoid compare int and double
values
Scale is a sorted array, so we
can use binary search; it’s more
efﬁcient and less dependent
on input data
static double[] Scale;
…
/* 600+ lines of code */
…
{
for(int i=0; i<Scale.Length && Scale[i]<=x; i++)
return i - 1;
}

Comparison
static double[] Scale;
/* 600+ lins of code */
{
for(int i=0;(i<Scale.Length)&&(Scale[i]<= x);i++);
return i - 1;
}
static int[] Scale;
/* 600+ lins of code */
var left = 1; var right = Scale.Length -1;
var mid =(left + right)>>1;//(left+right)/2
do {
mid = left + ((right - left)>>1);
if ( x < Scale[mid]) right = mid - 1;
else left = mid + 1;
} while (right >= left);
return mid;

Execution time
Old 9,754
9,754
0.00 %
-3 s 707 ms
Memory (MB)
New
Diff
%
32 s 551 ms
28 s 844 ms
0
11.39 %
х 1.00
х 1.13

#10 ReadData - Can It Be Faster?
ReadInt32( )
Used in main processing
function, so we shouldn’t touch
it right now.
ReadUInt32( )
In both cases, the binary ﬁle
read and the basis on read
data is the calculated number
of binary chains.

Don’t allocate and collect
unused data in a list
Read UInt64 values instead of
UIn32. It has cut the loop
length in half, but we have to
check if the loop length is not
odd
var list = new List<int>();
var lenght = br.BaseStream.Length – 4;
while (br.BaseStream.Position <= length){
br.ReadUInt64();
list.Add(n);
for (int i=0; i < n; i++)
br.ReadUInt32();
}

Comparison
var list = new List<int>();
while (br.BaseStream.Position <= length) {
br.ReadUInt64();
list.Add(n);
for(int i=0; i < n; i++) br.ReadUInt32();
}
while (br.BaseStream.Position <= length) {
br.ReadUInt64();
for(int i=0; i < n>>1; i++)
br.ReadUInt64();
if ((n & 1) == 1) br.ReadUInt32();
}

Execution time
Old 9,754
9,690
0.66 %
-4 s 796 ms
Memory (MB)
New
Diff
%
28 s 844 ms
24 s 048 ms
64
16.63 %
х 1.01
х 1.13

#11 List Usage - Can It Be Faster?
A lot of generic list usage (create
new list instances, add elements,
iterate).
It can be a convention or common
approach, but its usage is costly.
Even then, you just get element by
index.
var list = new List();
var item = list[i];

List<Point> Foo(List<Point> Points)
{
var R = new List<Point>();
foreach (var P in Points)
R.Add(new Point(P.X, ProcessPoint(P.Y)));
return R;
}

Favor Arrays Over Lists
Use arrays instead of lists if
possible. It allows simple array
indexing as opposed to add or [
] list functions
Use the capacity in list
constructor if it is known; it
allows you to add elements
without an internal array resize
public T this[int index] {
get {
if ((uint) index >= (uint)_size)
ThrowHelper.ThrowArgumentOutOfRangeException();
Contract.EndContractBlock();
return _items[index];
}
set {
if ((uint) index >= (uint)_size)
ThrowHelper.ThrowArgumentOutOfRangeException();
Contract.EndContractBlock();
_items[index] = value;
version++;
}
}

{
R.Add(new Point(P.X, ProcessPoint(P.Y)));
return R;
}

Favor Arrays Over Lists
Use simple for instead foreach to avoid :
- virtual GetEnumerator(), which
produces boxing
- Instance methods call get_Current( )
and the somewhat complex
MoveNext( )
…
callvirt instance valuetype GetEnumerator()
…
// loop start
…
call instance !0 valuetype get_Current()
…
call instance bool valuetype MoveNext()
…
// end loop

Comparison
{
R.Add(new Point(P.X,
ProcessPoint(P.Y)));
return R;
}
Point[] Foo(List<Point> Points)
{
var n = Points.Count;
var R = new Point[n];
for(var i = 0; i < n; i++)
R[i] = new Point(Points[i].X,
ProcessPoint(Points[i].Y));
return R;
}

Execution time
Old 9,690
7,842
19.07 %
-5 s 715 ms
Memory (MB)
New
Diff
%
24 s 048 ms
18 s 333 ms
1,848
23.76 %
х 1.24
х 1.31

#12 WriteBits - Can It Be Faster?
WriteBits is near the top again, and it has 64.56 % own time, so let’s try to optimize it.

void WriteBits(BinaryWriter bw, uint x, byte numberOfbits)
byte[] bits = GetBits(x, numberOfbits);
for (int i = 0; i < BitsBuffLength; i++) tmpB[i] = buffB[i];
for (int i = 0; i < bits.Length; i++) tmpB[i+ BitsBuffLength] = bits[i];
for (int i = 0; i < buffB.Length >> 3; i++) {
int j = i << 3;
bw.Write((byte)(buffB[j]<<7+buffB[j+1]<<6+buffB[j+2]<<5+buffB[j+3]<<4
+buffB[j+4]<<3+buffB[j+5]<<2+buffB[j+6]<<1+buffB[j+7]));
}
for (int i = L; i < buffB.Length; i++) {buffB[i - L] = buffB[i];}
BitsBuffLength = tmpB.Length - L;

Don't Forget to Use New Features
We forgot that GetBits can ﬁll arrays with offset.
Using it here can prevent redundant array copying.
byte[] tmpB = new byte[buffB.Length+ bits.Length];
for (int i = 0; i < BitsBuffLength; i++)
tmpB[i] = buffB[i];
for (int i = 0; i < bits.Length; i++)
tmpB[i+ BitsBuffLength] = bits[i];
byte[] tmpB = new byte[BitsBuffLength + numberOfBits];
GetBits(x, numberOfBits, tmpB, BitsBuffLength);
for (int i = 0; i < BitsBuffLength; i++)
tmpB[i] = buffB[i];

void WriteBits(BinaryWriter bw, uint x, byte numberOfbits)
for (int i = 0; i < bits.Length; i++) tmpB[i+ BitsBuffLength] = bits[i];
for (int i = 0; i < buffB.Length >> 3; i++) {
int j = i << 3;
bw.Write((byte)(buffB[j]<<7+buffB[j+1]<<6+buffB[j+2]<<5+buffB[j+3]<<4
+buffB[j+4]<<3+buffB[j+5]<<2+buffB[j+6]<<1+buffB[j+7]));
}

Even Small Operations Can Have
Signiﬁcant Impacts
Each iteration we calculate i << 3 and
offset for array buffB, we can counter
with step 8 and calculate offset for 8
needed elements
It’s not critical, but we change +
operation to |

Comparison
for (int i=0; i < buffB.Length >> 3; i++) {
int j = i << 3;
bw.Write(
(byte)(buffB[j]<<7+buffB[j+1]<<6
+buffB[j+2]<<5+buffB[j+3]<<4
+buffB[j+4]<<3+buffB[j+5]<<2
+buffB[j+6]<<1+buffB[j+7]));
}
for (int i=0; i<(buffB.Length >> 3) << 3; i+=8) {
bw.Write(
(byte)(buffB[i]<<7|buffB[i+1]<<6
|buffB[i+2]<<5|buffB[i+3]<<4
|buffB[i+4]<<3|buffB[i+5]<<2
|buffB[i+6]<<1|buffB[i+7]));
}

Execution time
Old 7,842
5,525
29.55 %
-1 s 966 ms
Memory (MB)
New
Diff
%
18 s 333 ms
16 s 367 ms
2,317
10.72 %
х 1.42
х 1.12

#13 Redundant Calls
uint statatistics1 …;
…
uint x = …;
…
/* byte NumberOfBits(unit x) */
statatistics1 += NumberOfBits(x);
statatistics2 += 28 - NumberOfBits(x);
…
uint x = …;
…
/* byte NumberOfBits(unit x) */
uint xn = NumberOfBits(x);
statatistics1 += xn;
statatistics2 += 28 - xn;

Execution time
Old 5,525
4,424
19.93 %
-1 s 331 ms
Memory (MB)
New
Diff
%
16 s 367 ms
15 s 036 ms
1,101
8.13 %
х 1.25
х 1.09

Execution time
Old 103,775
4,424
95.74 %
45 s 131 ms
Memory (MB)
New
Diff
%
1 m 10 s 167 ms
15 s 036 ms
99,341
78.57 %
х 23.46
х 4.67

Performance Summary
Old
New
Diff
%
Execution time Memory (MB) Peak (MB)
43 m 21 s 642 ms
15 s 036 ms
43 m 06 s 602 ms
99.42 %
x 173.02
374,011
4,424
369,577
98.82 %
х 84.54
93,26
33,67
59,59
63,89 %
х 2.77

What was the application doing
for 43 minutes?

LINKS
Use dotTrace Command-Line Proﬁler Hashtable and dictionary collection types
.NET Performance Optimization &
Proﬁling with JetBrains dotTrace
Why GC run when using a struct as a
generic dictionary
Matt Ellis. Writing Allocation Free Code
in C#
Konrad Kokosa. High-performance code
design patterns in C#
Maarten Balliauw. Let’s refresh our
memory! Memory management in .NET
Sasha Goldshtein. Pro .NET Performance:
Optimize Your C# Applications
Ben Watson. Writing High-Performance
.NET Code, 2nd Edition
Writing Faster Managed Code: Know
What Things Cost

Maarten Balliauw
LINKS
Sasha Goldshtein
Yevhen Tatarynov GitHub
Ling.Concat
Linq.Concat Implementation
Buffer.BlockCopy Generic List implementation
Optimizing software in C++
Denis Reznik video Array.Sort

Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov

Recommended

Recommended

More Related Content

Similar to Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov

Similar to Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov (20)

More from Fwdays

More from Fwdays (20)

Recently uploaded

Recently uploaded (20)

Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov