This presentation begins by reviewing the Task Parallel Library APIs, introduced in .NET 4.0 and expanded in .NET 4.5 -- the Task class, Parallel.For and Parallel.ForEach, and even Parallel LINQ. Then, we look at patterns and practices for extracting concurrency and managing dependencies, with real examples like Levenstein's edit distance algorithm, Fast Fourier Transform, and others.
2. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
AGENDA
Multicore machines have been a cheap
commodity for >10 years
Adoption of concurrent programming is
still slow
Patterns and best practices are scarce
We discuss the APIs first…
…and then turn to examples, best
practices, and tips
2
3. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
TPL EVOLUTION
The Future
•DataFlow in
.NET 4.5
(NuGet)
•Augmented
with
language
support
(await,
async
methods)
2012
•Released in
full glory
with .NET 4.0
2010
•Incubated
for 3 years
as “Parallel
Extensions
for .NET”
2008
3
4. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
TASKS
A task is a unit of work
May be executed in parallel with other tasks
by a scheduler (e.g. Thread Pool)
Much more than threads, and yet much
cheaper
4
Task<string> t = Task.Factory.StartNew(
() => { return DnaSimulation(…); });
t.ContinueWith(r => Show(r.Exception),
TaskContinuationOptions.OnlyOnFaulted);
t.ContinueWith(r => Show(r.Result),
TaskContinuationOptions.OnlyOnRanToCompletion);
DisplayProgress();
try { //The C# 5.0 version
var task = Task.Run(DnaSimulation);
DisplayProgress();
Show(await task);
}
catch (Exception ex) {
Show(ex);
}
5. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
PARALLEL LOOPS
Ideal for parallelizing work over a
collection of data
Easy porting of for and foreach loops
Beware of inter-iteration dependencies!
5
Parallel.For(0, 100, i => {
...
});
Parallel.ForEach(urls, url => {
webClient.Post(url, options, data);
});
6. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
PARALLEL LINQ
Mind-bogglingly easy parallelization of
LINQ queries
Can introduce ordering into the pipeline,
or preserve order of original elements
6
var query = from monster in monsters.AsParallel()
where monster.IsAttacking
let newMonster = SimulateMovement(monster)
orderby newMonster.XP
select newMonster;
query.ForAll(monster => Move(monster));
8. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
RECURSIVE PARALLELISM EXTRACTION
Divide-and-conquer algorithms are often
parallelized through the recursive call
Be careful with parallelization threshold
and watch out for dependencies
8
void FFT(float[] src, float[] dst, int n, int r, int s) {
if (n == 1) {
dst[r] = src[r];
} else {
FFT(src, n/2, r, s*2);
FFT(src, n/2, r+s, s*2);
//Combine the two halves in O(n) time
}
}
Parallel.Invoke(
() => FFT(src, n/2, r, s*2),
() => FFT(src, n/2, r+s, s*2)
);
9. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
SYMMETRIC DATA PROCESSING
For a large set of uniform data items that
need to processed, parallel loops are
usually the best choice and lead to ideal
work distribution
Inter-iteration dependencies complicate
things (think in-place blur)
9
Parallel.For(0, image.Rows, i => {
for (int j = 0; j < image.Cols; ++j) {
destImage.SetPixel(i, j, PixelBlur(image, i, j));
}
});
10. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
UNEVEN WORK DISTRIBUTION
With non-uniform data items, use custom
partitioning or manual distribution
Primes: 7 is easier to check than 10,320,647
10
var work = Enumerable.Range(0, Environment.ProcessorCount)
.Select(n => Task.Run(() =>
CountPrimes(start+chunk*n, start+chunk*(n+1))));
Task.WaitAll(work.ToArray());
VS
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize),
chunk => CountPrimes(chunk.Item1, chunk.Item2)
);
11. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
COMPLEX DEPENDENCY MANAGEMENT
Must extract all dependencies and
incorporate them into the algorithm
Typical scenarios: 1D loops, dynamic
algorithms
Edit distance: each task depends on 2
predecessors, wavefront computation
11
C = x[i-1] == y[i-1] ? 0 : 1;
D[i, j] = min(
D[i-1, j] + 1,
D[i, j-1] + 1,
D[i-1, j-1] + C);
0,0
m,n
12. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
SYNCHRONIZATION > AGGREGATION
Excessive synchronization brings parallel
code to its knees
Try to avoid shared state, or minimize
access to it
Aggregate thread- or task-local state
and merge later
12
Parallel.ForEach(
Partitioner.Create(Start, End, ChunkSize),
() => new List<int>(), //initial local state
(range, pls, localPrimes) => { //aggregator
for (int i = range.Item1; i < range.Item2; ++i)
if (IsPrime(i)) localPrimes.Add(i);
return localPrimes;
},
localPrimes => { lock (primes) //combiner
primes.AddRange(localPrimes);
});
13. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
CREATIVE SYNCHRONIZATION
We implement a collection of stock prices,
initialized with 105 name/price pairs
107 reads/s, 106 “update” writes/s, 103 “add”
writes/day
Many reader threads, many writer threads
13
GET(key):
if safe contains key then return safe[key]
lock { return unsafe[key] }
PUT(key, value):
if safe contains key then safe[key] = value
lock { unsafe[key] = value }
14. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
LOCK-FREE PATTERNS (1)
Try to avoid Windows synchronization
and use hardware synchronization
Primitive operations such as
Interlocked.Increment,
Interlocked.CompareExchange
Retry pattern with
Interlocked.CompareExchange enables
arbitrary lock-free algorithms
14
int InterlockedMultiply(ref int x, int y) {
int t, r;
do {
t = x;
r = t * y;
}
while (Interlocked.CompareExchange(ref x, r, t) != t);
return r;
}
NewValue
Comparand
OldValue
15. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
LOCK-FREE PATTERNS (2)
User-mode spinlocks (SpinLock class) can
replace locks you acquire very often,
which protect tiny computations
15
class __DontUseMe__SpinLock {
private int _lck;
public void Enter() {
while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0);
}
public void Exit() {
_lck = 0;
Thread.MemoryBarrier();
}
}
16. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
MISCELLANEOUS TIPS (1)
Don’t mix several concurrency
frameworks in the same process
Some parallel work is best organized in
pipelines – TPL DataFlow
16
BroadcastBlock
<Uri>
TransformBlock
<Uri, byte[]>
TransformBlock
<byte[],
string>
ActionBlock
<string>
17. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
MISCELLANEOUS TIPS (2)
Some parallel work can be offloaded to
the GPU – C++ AMP
17
void vadd_exp(float* x, float* y, float* z, int n) {
array_view<const float,1> avX(n, x), avY(n, y);
array_view<float,1> avZ(n, z);
avZ.discard_data();
parallel_for_each(avZ.extent, [=](index<1> i) ... {
avZ[i] = avX[i] + fast_math::exp(avY[i]);
});
avZ.synchronize();
}
18. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
MISCELLANEOUS TIPS (3)
Invest in SIMD parallelization of heavy
math or data-parallel algorithms
Make sure to take cache effects into
account, especially on MP systems
18
START:
movups xmm0, [esi+4*ecx]
addps xmm0, [edi+4*ecx]
movups [ebx+4*ecx], xmm0
sub ecx, 4
jns START
19. www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
SUMMARY
Avoid shared state and synchronization
Parallelize judiciously and apply
thresholds
Measure and understand performance
gains or losses
Concurrency and parallelism are still
hard
A body of best practices, tips, patterns,
examples is being built
19