Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Task and Data Parallelism


Published on

Presentation from DevWeek 2014 on task and data parallelism. This session explains the TPL APIs and then covers various scenarios for extracting concurrency, reducing synchronization, putting thresholds on parallelization, and other topics.

Published in: Technology
  • Be the first to comment

Task and Data Parallelism

  1. 1. Sasha Goldshtein CTO, Sela Group Task and Data Parallelism
  2. 2. Agenda •Multicore machines have been a cheap commodity for >10 years •Adoption of concurrent programming is still slow •Patterns and best practices are scarce •We discuss the APIs first… •…and then turn to examples, best practices, and tips
  3. 3. TPL Evolution • GPU parallelism? • SIMD support? • Language- level parallelism? The Future • DataFlow in .NET 4.5 (NuGet) • Augmented with language support (await, async methods) 2012 • Released in full glory with .NET 4.0 2010 • Incubated for 3 years as “Parallel Extensions for .NET” 2008
  4. 4. Tasks •A task is a unit of work –May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool) –Much more than threads, and yet much cheaper Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); }); t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted); t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion); DisplayProgress(); try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task); } catch (Exception ex) { Show(ex); }
  5. 5. Parallel Loops •Ideal for parallelizing work over a collection of data •Easy porting of for and foreach loops –Beware of inter-iteration dependencies! Parallel.For(0, 100, i => { ... }); Parallel.ForEach(urls, url => { webClient.Post(url, options, data); });
  6. 6. Parallel LINQ •Mind-bogglingly easy parallelization of LINQ queries •Can introduce ordering into the pipeline, or preserve order of original elements var query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster; query.ForAll(monster => Move(monster));
  7. 7. Measuring Concurrency •Visual Studio Concurrency Visualizer to the rescue
  8. 8. Recursive Parallelism Extraction •Divide-and-conquer algorithms are often parallelized through the recursive call –Be careful with parallelization threshold and watch out for dependencies void FFT(float[] src, float[] dst, int n, int r, int s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time } } Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2) );
  9. 9. DEMO Recursive parallel QuickSort
  10. 10. Symmetric Data Processing •For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution •Inter-iteration dependencies complicate things (think in-place blur) Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); } });
  11. 11. Uneven Work Distribution •With non-uniform data items, use custom partitioning or manual distribution –Primes: 7 is easier to check than 10,320,647 var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1)))); Task.WaitAll(work.ToArray()); versus Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2) );
  12. 12. DEMO Uneven workload distribution
  13. 13. Complex Dependency Management •Must extract all dependencies and incorporate them into the algorithm –Typical scenarios: 1D loops, dynamic algorithms –Edit distance: each task depends on 2 predecessors, wavefront C = x[i-1] == y[i-1] ? 0 : 1; D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C); 0,0 m,n
  14. 14. DEMO Dependency management
  15. 15. Synchronization > Aggregation •Excessive synchronization brings parallel code to its knees –Try to avoid shared state –Aggregate thread- or task-local state and mergeParallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes); });
  16. 16. DEMO Aggregation
  17. 17. Creative Synchronization • We implement a collection of stock prices, initialized with 105 name/price pairs – 107 reads/s, 106 “update” writes/s, 103 “add” writes/day – Many reader threads, many writer threads GET(key): if safe contains key then return safe[key] lock { return unsafe[key] } PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
  18. 18. Lock-Free Patterns (1) •Try to avoid Windows synchronization and use hardware synchronization –Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange –Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r; } Oldvalue Newvalue Comparand
  19. 19. Lock-Free Patterns (2) •User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computations class __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; } }
  20. 20. Miscellaneous Tips (1) •Don’t mix several concurrency frameworks in the same process •Some parallel work is best organized in pipelines – TPL DataFlow BroadcastBlock <Uri> TransformBlock <Uri, byte[]> TransformBlock <byte[], string> ActionBlock <string>
  21. 21. Miscellaneous Tips (2) •Some parallel work can be offloaded to the GPU – C++ AMP void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize(); }
  22. 22. Miscellaneous Tips (3) •Invest in SIMD parallelization of heavy math or data-parallel algorithms –Already available on Mono (Mono.Simd) •Make sure to take cache effects into account, especially on MP systems START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4 jns START
  23. 23. Summary • Avoid shared state and synchronization • Parallelize judiciously and apply thresholds • Measure and understand performance gains or losses • Concurrency and parallelism are still hard • A body of best practices, tips, patterns, examples is being built
  24. 24. Additional References
  25. 25. THANK YOU! Sasha Goldshtein CTO, Sela Group @goldshtn