Task and Data Parallelism: Real-World Examples

  • 3,544 views
Uploaded on

This presentation begins by reviewing the Task Parallel Library APIs, introduced in .NET 4.0 and expanded in .NET 4.5 -- the Task class, Parallel.For and Parallel.ForEach, and even Parallel LINQ. …

This presentation begins by reviewing the Task Parallel Library APIs, introduced in .NET 4.0 and expanded in .NET 4.5 -- the Task class, Parallel.For and Parallel.ForEach, and even Parallel LINQ. Then, we look at patterns and practices for extracting concurrency and managing dependencies, with real examples like Levenstein's edit distance algorithm, Fast Fourier Transform, and others.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,544
On Slideshare
0
From Embeds
0
Number of Embeds
26

Actions

Shares
Downloads
16
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sasha Goldshtein CTO Sela Group @goldshtn blog.sashag.net Task and Data Parallelism: Real-World Examples
  • 2. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS AGENDA Multicore machines have been a cheap commodity for >10 years Adoption of concurrent programming is still slow Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips 2
  • 3. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS TPL EVOLUTION The Future •DataFlow in .NET 4.5 (NuGet) •Augmented with language support (await, async methods) 2012 •Released in full glory with .NET 4.0 2010 •Incubated for 3 years as “Parallel Extensions for .NET” 2008 3
  • 4. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS TASKS A task is a unit of work May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool) Much more than threads, and yet much cheaper 4 Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); }); t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted); t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion); DisplayProgress(); try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task); } catch (Exception ex) { Show(ex); }
  • 5. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS PARALLEL LOOPS Ideal for parallelizing work over a collection of data Easy porting of for and foreach loops Beware of inter-iteration dependencies! 5 Parallel.For(0, 100, i => { ... }); Parallel.ForEach(urls, url => { webClient.Post(url, options, data); });
  • 6. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS PARALLEL LINQ Mind-bogglingly easy parallelization of LINQ queries Can introduce ordering into the pipeline, or preserve order of original elements 6 var query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster; query.ForAll(monster => Move(monster));
  • 7. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS MEASURING CONCURRENCY Visual Studio Concurrency Visualizer to the rescue 7
  • 8. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS RECURSIVE PARALLELISM EXTRACTION Divide-and-conquer algorithms are often parallelized through the recursive call Be careful with parallelization threshold and watch out for dependencies 8 void FFT(float[] src, float[] dst, int n, int r, int s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time } } Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2) );
  • 9. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS SYMMETRIC DATA PROCESSING For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution Inter-iteration dependencies complicate things (think in-place blur) 9 Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); } });
  • 10. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS UNEVEN WORK DISTRIBUTION With non-uniform data items, use custom partitioning or manual distribution Primes: 7 is easier to check than 10,320,647 10 var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1)))); Task.WaitAll(work.ToArray()); VS Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2) );
  • 11. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS COMPLEX DEPENDENCY MANAGEMENT Must extract all dependencies and incorporate them into the algorithm Typical scenarios: 1D loops, dynamic algorithms Edit distance: each task depends on 2 predecessors, wavefront computation 11 C = x[i-1] == y[i-1] ? 0 : 1; D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C); 0,0 m,n
  • 12. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS SYNCHRONIZATION > AGGREGATION Excessive synchronization brings parallel code to its knees Try to avoid shared state, or minimize access to it Aggregate thread- or task-local state and merge later 12 Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes); });
  • 13. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS CREATIVE SYNCHRONIZATION We implement a collection of stock prices, initialized with 105 name/price pairs 107 reads/s, 106 “update” writes/s, 103 “add” writes/day Many reader threads, many writer threads 13 GET(key): if safe contains key then return safe[key] lock { return unsafe[key] } PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
  • 14. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS LOCK-FREE PATTERNS (1) Try to avoid Windows synchronization and use hardware synchronization Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms 14 int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r; } NewValue Comparand OldValue
  • 15. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS LOCK-FREE PATTERNS (2) User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computations 15 class __DontUseMe__SpinLock { private int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; Thread.MemoryBarrier(); } }
  • 16. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS MISCELLANEOUS TIPS (1) Don’t mix several concurrency frameworks in the same process Some parallel work is best organized in pipelines – TPL DataFlow 16 BroadcastBlock <Uri> TransformBlock <Uri, byte[]> TransformBlock <byte[], string> ActionBlock <string>
  • 17. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS MISCELLANEOUS TIPS (2) Some parallel work can be offloaded to the GPU – C++ AMP 17 void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize(); }
  • 18. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS MISCELLANEOUS TIPS (3) Invest in SIMD parallelization of heavy math or data-parallel algorithms Make sure to take cache effects into account, especially on MP systems 18 START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4 jns START
  • 19. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS SUMMARY  Avoid shared state and synchronization  Parallelize judiciously and apply thresholds  Measure and understand performance gains or losses  Concurrency and parallelism are still hard  A body of best practices, tips, patterns, examples is being built 19
  • 20. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS ADDITIONAL REFERENCES
  • 21. www.devconnections.com GARBAGE COLLECTION PERFORMANCE TIPS THANK YOU! Sasha Goldshtein @goldshtn sashag@sela.co.il blog.sashag.net 21