This document discusses strategies for task and data parallelism in .NET. It begins with an overview of tasks and how they provide a cheaper alternative to threads for parallelizing work. Various APIs for parallelism are covered, including Parallel Loops, PLINQ, and task continuations. Best practices are provided around uneven work distribution, dependency management, minimizing synchronization, and leveraging lock-free and dataflow patterns. The document concludes with tips on profiling, SIMD, and GPU parallelism.
2. Agenda
•Multicore machines have been a cheap
commodity for >10 years
•Adoption of concurrent programming is
still slow
•Patterns and best practices are scarce
•We discuss the APIs first…
•…and then turn to examples, best
practices, and tips
3. TPL Evolution
• GPU
parallelism?
• SIMD
support?
• Language-
level
parallelism?
The Future
• DataFlow in
.NET 4.5
(NuGet)
• Augmented
with
language
support
(await, async
methods)
2012
• Released in
full glory
with .NET
4.0
2010
• Incubated
for 3 years as
“Parallel
Extensions
for .NET”
2008
4. Tasks
•A task is a unit of work
–May be executed in parallel with other tasks by
a scheduler (e.g. Thread Pool)
–Much more than threads, and yet much
cheaper
Task<string> t = Task.Factory.StartNew(
() => { return DnaSimulation(…); });
t.ContinueWith(r => Show(r.Exception),
TaskContinuationOptions.OnlyOnFaulted);
t.ContinueWith(r => Show(r.Result),
TaskContinuationOptions.OnlyOnRanToCompletion);
DisplayProgress();
try { //The C# 5.0 version
var task = Task.Run(DnaSimulation);
DisplayProgress();
Show(await task);
} catch (Exception ex) {
Show(ex);
}
5. Parallel Loops
•Ideal for parallelizing work over a collection
of data
•Easy porting of for and foreach loops
–Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => {
...
});
Parallel.ForEach(urls, url => {
webClient.Post(url, options, data);
});
6. Parallel LINQ
•Mind-bogglingly easy parallelization of
LINQ queries
•Can introduce ordering into the pipeline, or
preserve order of original elements
var query = from monster in monsters.AsParallel()
where monster.IsAttacking
let newMonster = SimulateMovement(monster)
orderby newMonster.XP
select newMonster;
query.ForAll(monster => Move(monster));
8. Recursive Parallelism Extraction
•Divide-and-conquer algorithms are often
parallelized through the recursive call
–Be careful with parallelization threshold and
watch out for dependencies
void FFT(float[] src, float[] dst, int n, int r, int s) {
if (n == 1) {
dst[r] = src[r];
} else {
FFT(src, n/2, r, s*2);
FFT(src, n/2, r+s, s*2);
//Combine the two halves in O(n) time
}
}
Parallel.Invoke(
() => FFT(src, n/2, r, s*2),
() => FFT(src, n/2, r+s, s*2)
);
10. Symmetric Data Processing
•For a large set of uniform data items that
need to processed, parallel loops are usually
the best choice and lead to ideal work
distribution
•Inter-iteration dependencies complicate
things (think in-place blur)
Parallel.For(0, image.Rows, i => {
for (int j = 0; j < image.Cols; ++j) {
destImage.SetPixel(i, j, PixelBlur(image, i, j));
}
});
11. Uneven Work Distribution
•With non-uniform data items, use custom
partitioning or manual distribution
–Primes: 7 is easier to check than 10,320,647
var work = Enumerable.Range(0, Environment.ProcessorCount)
.Select(n => Task.Run(() =>
CountPrimes(start+chunk*n, start+chunk*(n+1))));
Task.WaitAll(work.ToArray());
versus
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize),
chunk => CountPrimes(chunk.Item1, chunk.Item2)
);
15. Synchronization > Aggregation
•Excessive synchronization brings parallel
code to its knees
–Try to avoid shared state
–Aggregate thread- or task-local state and mergeParallel.ForEach(
Partitioner.Create(Start, End, ChunkSize),
() => new List<int>(), //initial local state
(range, pls, localPrimes) => { //aggregator
for (int i = range.Item1; i < range.Item2; ++i)
if (IsPrime(i)) localPrimes.Add(i);
return localPrimes;
},
localPrimes => { lock (primes) //combiner
primes.AddRange(localPrimes);
});
17. Creative Synchronization
• We implement a collection of stock prices,
initialized with 105 name/price pairs
– 107 reads/s, 106 “update” writes/s, 103 “add”
writes/day
– Many reader threads, many writer threads
GET(key):
if safe contains key then return safe[key]
lock { return unsafe[key] }
PUT(key, value):
if safe contains key then safe[key] = value
lock { unsafe[key] = value }
18. Lock-Free Patterns (1)
•Try to avoid Windows synchronization and
use hardware synchronization
–Primitive operations such as
Interlocked.Increment,
Interlocked.CompareExchange
–Retry pattern with
Interlocked.CompareExchange enables
arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) {
int t, r;
do {
t = x;
r = t * y;
}
while (Interlocked.CompareExchange(ref x, r, t) != t);
return r;
}
Oldvalue
Newvalue
Comparand
19. Lock-Free Patterns (2)
•User-mode spinlocks (SpinLock class) can
replace locks you acquire very often, which
protect tiny computations
class __DontUseMe__SpinLock {
private volatile int _lck;
public void Enter() {
while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0);
}
public void Exit() {
_lck = 0;
}
}
20. Miscellaneous Tips (1)
•Don’t mix several concurrency frameworks
in the same process
•Some parallel work is best organized in
pipelines – TPL DataFlow
BroadcastBlock
<Uri>
TransformBlock
<Uri, byte[]>
TransformBlock
<byte[],
string>
ActionBlock
<string>
21. Miscellaneous Tips (2)
•Some parallel work can be offloaded to the
GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) {
array_view<const float,1> avX(n, x), avY(n, y);
array_view<float,1> avZ(n, z);
avZ.discard_data();
parallel_for_each(avZ.extent, [=](index<1> i) ... {
avZ[i] = avX[i] + fast_math::exp(avY[i]);
});
avZ.synchronize();
}
22. Miscellaneous Tips (3)
•Invest in SIMD parallelization of heavy
math or data-parallel algorithms
–Already available on Mono (Mono.Simd)
•Make sure to take cache effects into
account, especially on MP systems
START:
movups xmm0, [esi+4*ecx]
addps xmm0, [edi+4*ecx]
movups [ebx+4*ecx], xmm0
sub ecx, 4
jns START
23. Summary
• Avoid shared state and synchronization
• Parallelize judiciously and apply
thresholds
• Measure and understand performance
gains or losses
• Concurrency and parallelism are still hard
• A body of best practices, tips, patterns,
examples is being built