Mårten Rånge
WCOM AB

@marten_range
Concurrency
Examples for .NET
Responsive
Performance
Scalable algorithms
Three pillars of Concurrency
 Scalability (CPU)
 Parallel.For

 Responsiveness
 Task/Future
 async/await

 Consistency





lock/synchronized
Interlocked.*
Mutex/Event/Semaphore
Monitor
Scalability
Which is fastest?
var ints = new int[InnerLoop];
var random = new Random();
for (var inner = 0; inner < InnerLoop; ++inner)
{
ints[inner] = random.Next();
}
// -----------------------------------------------var ints = new int[InnerLoop];
var random = new Random();
Parallel.For(
0,
InnerLoop,
i => ints[i] = random.Next()
);
SHARED STATE  Race condition
var ints = new int[InnerLoop];
var random = new Random();
for (var inner = 0; inner < InnerLoop; ++inner)
{
ints[inner] = random.Next();
}
// -----------------------------------------------var ints = new int[InnerLoop];
var random = new Random();
Parallel.For(
0,
InnerLoop,
i => ints[i] = random.Next()
);
SHARED STATE  Poor performance
var ints = new int[InnerLoop];
var random = new Random();
for (var inner = 0; inner < InnerLoop; ++inner)
{
ints[inner] = random.Next();
}
// -----------------------------------------------var ints = new int[InnerLoop];
var random = new Random();
Parallel.For(
0,
InnerLoop,
i => ints[i] = random.Next()
);
Then and now
Metric

VAX-11/750 (’80)

Today

Improvement

MHz

6

3300

550x

Memory MB

2

16384

8192x

Memory MB/s

13

R ~10000
W ~2500

770x
190x
Then and now
Metric

VAX-11/750 (’80)

Today

Improvement

MHz

6

3300

550x

Memory MB

2

16384

8192x

Memory MB/s

13

Memory nsec

225

R ~10000
W ~2500
70

770x
190x
3x
Then and now
Metric

VAX-11/750 (’80)

Today

Improvement

MHz

6

3300

550x

Memory MB

2

16384

8192x

Memory MB/s

13

Memory nsec

225

R ~10000
W ~2500
70

770x
190x
3x

Memory cycles

1.4

210

-150x
299,792,458 m/s
Speed of light is too slow
0.09 m/c
99% - latency mitigation
1% - computation
2 Core CPU
CPU1

CPU2

L1

L1

L2

L2
L3

RAM
2 Core CPU – L1 Cache

CPU1

L1

CPU2
new Random ()

new int[InnerLoop]

L1
2 Core CPU – L1 Cache

CPU1

CPU2

Random object

Random object

L1

L1
2 Core CPU – L1 Cache

CPU1

CPU2

Random object

Random object

L1

L1
2 Core CPU – L1 Cache

CPU1

CPU2

Random object

Random object

L1

L1
2 Core CPU – L1 Cache

CPU1

CPU2

Random object

Random object

L1

L1
2 Core CPU – L1 Cache

CPU1

CPU2

Random object

Random object

L1

L1
2 Core CPU – L1 Cache

CPU1

CPU2

Random object

Random object

L1

L1
4 Core CPU – L1 Cache

CPU1

L1

CPU2

L1

CPU3

new Random ()

new int[InnerLoop]

L1

CPU4

L1
2x4 Core CPU
CPU1 CPU2 CPU3 CPU4

CPU5 CPU6 CPU7 CPU8

L1

L1

L1

L1

L1

L1

L1

L1

L2

L2

L2

L2

L2

L2

L2

L2

L3

L3

RAM
Solution 1 – Locks
var ints = new int[InnerLoop];
var random = new Random();
Parallel.For(
0,
InnerLoop,
i => {lock (ints) {ints[i] = random.Next();}}
);
Solution 2 – No sharing
var ints = new int[InnerLoop];
Parallel.For(
0,
InnerLoop,
() => new Random(),
(i, pls, random) =>
{ints[i] = random.Next(); return random;},
random => {}
);
Parallel.For adds overhead
Level2
Level1

Level2
Level0
Level2

Level1
Level2

ints[0]

ints[1]
ints[2]
ints[3]

ints[4]
ints[5]
ints[6]

ints[7]
Solution 3 – Less overhead
var ints = new int[InnerLoop];
Parallel.For(
0,
InnerLoop / Modulus,
() => new Random(),
(i, pls, random) =>
{
var begin
= i * Modulus
;
var end
= begin + Modulus
;
for (var iter = begin; iter < end; ++iter)
{
ints[iter] = random.Next();
}
return random;
},
random => {}
);
var ints = new int[InnerLoop];
var random = new Random();
for (var inner = 0; inner < InnerLoop; ++inner)
{
ints[inner] = random.Next();
}
Solution 4 – Independent runs
var tasks = Enumerable.Range (0, 8).Select (
i => Task.Factory.StartNew (
() =>
{
var ints = new int[InnerLoop];
var random = new Random ();
while (counter.CountDown ())
{
for (var inner = 0; inner < InnerLoop; ++inner)
{
ints[inner] = random.Next();
}
}
},
TaskCreationOptions.LongRunning))
.ToArray ();
Task.WaitAll (tasks);
Parallel.For
Only for CPU bound problems
Sharing is bad
Kills performance
Race conditions
Dead-locks
Cache locality
RAM is a misnomer
Class design
Avoid GC
Natural concurrency
Avoid Parallel.For
Act like an engineer
Measure before and after
One more thing…
http://tinyurl.com/wcom-cpuscalability
Mårten Rånge
WCOM AB

@marten_range

Concurrency scalability