TPL Evolution and Best Practices for Task and Data Parallelism

Sasha Goldshtein
CTO, Sela Group
Task and Data Parallelism

Agenda
•Multicore machines have been a cheap
commodity for >10 years
•Adoption of concurrent programming is
still slow
•Patterns and best practices are scarce
•We discuss the APIs first…
•…and then turn to examples, best
practices, and tips

TPL Evolution
• GPU
parallelism?
• SIMD
support?
• Language-
level
parallelism?
The Future
• DataFlow in
.NET 4.5
(NuGet)
• Augmented
with
language
support
(await, async
methods)
2012
• Released in
full glory
with .NET
4.0
2010
• Incubated
for 3 years as
“Parallel
Extensions
for .NET”
2008

Tasks
•A task is a unit of work
–May be executed in parallel with other tasks by
a scheduler (e.g. Thread Pool)
–Much more than threads, and yet much
cheaper
Task<string> t = Task.Factory.StartNew(
() => { return DnaSimulation(…); });
t.ContinueWith(r => Show(r.Exception),
TaskContinuationOptions.OnlyOnFaulted);
t.ContinueWith(r => Show(r.Result),
TaskContinuationOptions.OnlyOnRanToCompletion);
DisplayProgress();
try { //The C# 5.0 version
var task = Task.Run(DnaSimulation);
DisplayProgress();
Show(await task);
} catch (Exception ex) {
Show(ex);
}

Parallel Loops
•Ideal for parallelizing work over a collection
of data
•Easy porting of for and foreach loops
–Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => {
...
});
Parallel.ForEach(urls, url => {
webClient.Post(url, options, data);
});

Parallel LINQ
•Mind-bogglingly easy parallelization of
LINQ queries
•Can introduce ordering into the pipeline, or
preserve order of original elements
var query = from monster in monsters.AsParallel()
where monster.IsAttacking
let newMonster = SimulateMovement(monster)
orderby newMonster.XP
select newMonster;
query.ForAll(monster => Move(monster));

Measuring Concurrency
•Visual Studio Concurrency Visualizer to the
rescue

Recursive Parallelism Extraction
•Divide-and-conquer algorithms are often
parallelized through the recursive call
–Be careful with parallelization threshold and
watch out for dependencies
void FFT(float[] src, float[] dst, int n, int r, int s) {
if (n == 1) {
dst[r] = src[r];
} else {
FFT(src, n/2, r, s*2);
FFT(src, n/2, r+s, s*2);
//Combine the two halves in O(n) time
}
}
Parallel.Invoke(
() => FFT(src, n/2, r, s*2),
() => FFT(src, n/2, r+s, s*2)
);

DEMO
Recursive parallel QuickSort

Symmetric Data Processing
•For a large set of uniform data items that
need to processed, parallel loops are usually
the best choice and lead to ideal work
distribution
•Inter-iteration dependencies complicate
things (think in-place blur)
Parallel.For(0, image.Rows, i => {
for (int j = 0; j < image.Cols; ++j) {
destImage.SetPixel(i, j, PixelBlur(image, i, j));
}
});

Uneven Work Distribution
•With non-uniform data items, use custom
partitioning or manual distribution
–Primes: 7 is easier to check than 10,320,647
var work = Enumerable.Range(0, Environment.ProcessorCount)
.Select(n => Task.Run(() =>
CountPrimes(start+chunk*n, start+chunk*(n+1))));
Task.WaitAll(work.ToArray());
versus
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize),
chunk => CountPrimes(chunk.Item1, chunk.Item2)
);

DEMO
Uneven workload distribution

Complex Dependency Management
•Must extract all dependencies and
incorporate them into the algorithm
–Typical scenarios: 1D loops, dynamic
algorithms
–Edit distance: each task depends on 2
predecessors, wavefront
C = x[i-1] == y[i-1] ? 0 : 1;
D[i, j] = min(
D[i-1, j] + 1,
D[i, j-1] + 1,
D[i-1, j-1] + C);
0,0
m,n

Synchronization > Aggregation
•Excessive synchronization brings parallel
code to its knees
–Try to avoid shared state
–Aggregate thread- or task-local state and mergeParallel.ForEach(
Partitioner.Create(Start, End, ChunkSize),
() => new List<int>(), //initial local state
(range, pls, localPrimes) => { //aggregator
for (int i = range.Item1; i < range.Item2; ++i)
if (IsPrime(i)) localPrimes.Add(i);
return localPrimes;
},
localPrimes => { lock (primes) //combiner
primes.AddRange(localPrimes);
});

Creative Synchronization
• We implement a collection of stock prices,
initialized with 105 name/price pairs
– 107 reads/s, 106 “update” writes/s, 103 “add”
writes/day
– Many reader threads, many writer threads
GET(key):
if safe contains key then return safe[key]
lock { return unsafe[key] }
PUT(key, value):
if safe contains key then safe[key] = value
lock { unsafe[key] = value }

Lock-Free Patterns (1)
•Try to avoid Windows synchronization and
use hardware synchronization
–Primitive operations such as
Interlocked.Increment,
Interlocked.CompareExchange
–Retry pattern with
Interlocked.CompareExchange enables
arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) {
int t, r;
do {
t = x;
r = t * y;
}
while (Interlocked.CompareExchange(ref x, r, t) != t);
return r;
}
Oldvalue
Newvalue
Comparand

Lock-Free Patterns (2)
•User-mode spinlocks (SpinLock class) can
replace locks you acquire very often, which
protect tiny computations
class __DontUseMe__SpinLock {
private volatile int _lck;
public void Enter() {
while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0);
}
public void Exit() {
_lck = 0;
}
}

Miscellaneous Tips (1)
•Don’t mix several concurrency frameworks
in the same process
•Some parallel work is best organized in
pipelines – TPL DataFlow
BroadcastBlock
<Uri>
TransformBlock
<Uri, byte[]>
TransformBlock
<byte[],
string>
ActionBlock
<string>

•Some parallel work can be offloaded to the
GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) {
array_view<const float,1> avX(n, x), avY(n, y);
array_view<float,1> avZ(n, z);
avZ.discard_data();
parallel_for_each(avZ.extent, [=](index<1> i) ... {
avZ[i] = avX[i] + fast_math::exp(avY[i]);
});
avZ.synchronize();
}

•Invest in SIMD parallelization of heavy
math or data-parallel algorithms
–Already available on Mono (Mono.Simd)
•Make sure to take cache effects into
account, especially on MP systems
START:
movups xmm0, [esi+4*ecx]
addps xmm0, [edi+4*ecx]
movups [ebx+4*ecx], xmm0
sub ecx, 4
jns START

Summary
• Avoid shared state and synchronization
• Parallelize judiciously and apply
thresholds
• Measure and understand performance
gains or losses
• Concurrency and parallelism are still hard
• A body of best practices, tips, patterns,
examples is being built

THANK YOU!
Sasha Goldshtein
CTO, Sela Group
blog.sashag.net
@goldshtn

TPL Evolution and Best Practices for Task and Data Parallelism

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to TPL Evolution and Best Practices for Task and Data Parallelism

Similar to TPL Evolution and Best Practices for Task and Data Parallelism (20)

More from Sasha Goldshtein

More from Sasha Goldshtein (20)

Recently uploaded

Recently uploaded (20)

TPL Evolution and Best Practices for Task and Data Parallelism