Visual Studio 2010<br />Using the Parallel Computing Platform<br />Phil Pennington<br />philpenn@microsoft.com<br />
Agenda<br />2<br />What’s new with Windows?<br />Parallel Computing Tools in Visual Studio<br />Using .NET Parallel Extens...
First, An ExampleMonte Carlo Approximation of Pi<br />S = 4*r*r <br />C = Pi*r*r<br />Pi = 4*(C/S)<br />For each Point (P)...
Windows and Maximum Processors<br />Before Win7/R2, the maximum number of Logical Processors (LPs) was dictated by process...
Processor GroupsNew with Windows7 and Windows Server R2<br />5<br />GROUP<br />NUMA NODE<br />Socket<br />Socket<br />Core...
Processor GroupsExample: 2 Groups, 4 nodes, 8 sockets, 32 cores, 128 LP’s <br />6<br />Group<br />Group<br />NUMA Node<br ...
Many-Core Topology APIs Discovery<br />7<br />
Many-Core Topology APIs Resource Localization<br />8<br />
Many-Core Topology APIs Memory Management<br />9<br />
User Mode SchedulingArchitectural Perspective<br />UMS Scheduler’s Ready List<br />Your Scheduler<br />Logic<br />Wait<br ...
Task Scheduling with a UMS SchedulerMaximize Quantum, Minimize Blocking Affects<br />Tasks are run by worker threads, whic...
Load-Balancing, Work Stealing Scheduler<br />DynamicScheduling<br />Static Scheduling<br />CPU0<br />CPU1<br />CPU2<br />C...
Demos<br />The Platform<br />- Topology<br />- Schedulers<br />
Agenda<br />14<br />What’s new with Windows?<br />Parallel Computing Tools in Visual Studio<br />Using .NET Parallel Exten...
Visual Studio 2010, .NET Developer Tools, Programming Models, Runtimes<br />Tools<br />Programming Models – Structured Par...
Thread-Pool Scheduler in .NET 4.0<br />Thread 1<br />Dispatch Loop<br />Thread 2<br />Dispatch Loop<br />Thread N<br />Dis...
Task Parallel Library (TPL)Tasks Concepts<br />Common Functionality: waiting, cancellation, continuations, parent/child re...
Primitives and Structures<br />Thread-safe, scalable collections<br />IProducerConsumerCollection<T><br />ConcurrentQueue<...
Parallel Debugging<br />Two new debugger toolwindows<br />Support both native and managed<br />“Parallel Tasks”<br />“Para...
Parallel Tasks<br /><ul><li>What threads are executing my tasks?
Where are my tasks running (location, call stack)?
Which tasks are blocked?
How many tasks are waiting to run?</li></li></ul><li>Parallel Stacks<br />Zoom <br />control<br /><ul><li>Multiple call st...
Task-specific view (Task status)
Easy navigation to any executing method
Rich UI (zooming, panning, bird’s eye view, flagging, tooltips)</li></ul>Bird’s eye view<br />
Parallel Profiling<br />
CPU Utilization<br />Other processes<br />Number of cores<br />Idle time<br />Your Process<br />
Threads<br />Measure time for interesting segments<br />Hide uninteresting threads<br />Zoom in and out<br />Detailed thre...
Cores<br />Each logical core in a swim lane<br />One color per thread<br />Migration visualization<br />Cross-core migrati...
Demo<br />Libraries<br />Languages<br />Debuggers<br />Profilers<br />
Agenda<br />27<br />What’s new with Windows?<br />Parallel Computing Tools in Visual Studio<br />Using .NET Parallel Exten...
Thinking Parallel - “Task” vs. “Data” Parallelism<br />Task Parallelism<br />Parallel.Invoke(<br />		() =>	{ Console.Write...
Upcoming SlideShare
Loading in …5
×

Using Parallel Computing Platform - NHDNUG

1,397 views

Published on

Slides from Phil Pennington\'s talk on Using Parallel Computing with Visual Studio 2010 and .NET 4.0, originally presented at the North Houston .NET Users Group (facebook.com/nhdnug).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,397
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Let’s use this slide for an “Architectural Perspective” of UMS.&lt;CLICK&gt;S1 and S2 are the first threads created within a UMS solution. These are “Scheduler Threads” or “Primary Threads”. These threads represent “core” or physical CPU’s from a Scheduler perspective. These are normal threads to begin with, but you would typically first establish processor affinity using the new CreateRemoteThreadEx API and the use a new API, EnterUmsSchedulingMode, to specify that the new thread is a Scheduler thread.You pass in a callback, i.e. UMSSchedulerProc, function pointer to begin executing instructions on the Scheduler thread.A UMS worker thread is created by calling CreateRemoteThreadEx with the PROC_THREAD_ATTRIBUTE_UMS_THREAD attribute and specifying a UMS thread context and a completion list. The OS places these threads into the Completion List and your Scheduler logic takes over typically placing the new threads onto the Scheduler’s Ready List.&lt;CLICK&gt;The first thing that a Scheduler should do is move it’s associated Worker threads onto the Scheduler’s Ready List. Then, it can began executing your customer scheduler logic.&lt;CLICK&gt;Each of the Scheduler threads should then pop a Worker thread off of the Ready List and run it on the associated Core. When this occurs, the Scheduler thread context is essentially lost forever… the Worker thread now owns the core and is executing. The Scheduler thread will not regain the core until a processor Yield event occurs.&lt;CLICK&gt;The first thing that could happen is that this thread could yield. Yield is again a Scheduler callback mechanism and perhaps the single most important function of UMS. It’s within the Yield that you will implement your own synchronization primitives and scheduling logic.Ideally, the yielding thread provides some contextual information to the scheduler (maybe it wants to wait on some specific application domain event to occur). Your Scheduler would look at this Yield request and associated context and make a scheduling decision.&lt;CLICK&gt;Maybe the Scheduler places the Worker thread within a Wait list for that specific event or event type.Now your Scheduler has to decide what to run next. &lt;CLICK&gt;Maybe the next Worker thread from the Ready List, for instance… and we’re back running again. Note, that no kernel scheduling context switch was necessary. Maybe that wait event handling took 200 cycles in user-mode. It may have cost 10 times that with a kernel context switch.&lt;CLICK&gt;Let’s now assume that this worker performs a system call… At this point, we switch the worker thread to it’s kernel-mode context and the thread continues to run within the kernel. If it does not block (in other words, if it doesn’t use one of the kernel synchronization primitives, then it just continues to run. If the thread never blocks in the kernel, then it just returns to user-mode and continues to run and do work.&lt;CLICK&gt;Let’s assume that the thread does block. Maybe a page fault occurred, for instance. Now our Scheduler thread regains control of the processor via a callback from the kernel. Now, the kernel is telling your Scheduler that a worker thread is blocked and the reason for that block. This is the point where we integrate kernel synchronization with user synchronization. But now, you get to decide what to run next.&lt;CLICK&gt;The Scheduler looks at the state of it’s affairs and perhaps decides to run the next Worker thread from the Ready List, for example.Let’s assume that later in time Worker 3 unblocks. &lt;CLICK&gt;The kernel will now place this unblocked Worker thread into the UMS Completion List.&lt;CLICK&gt;At the next Yield event, for instance, we get another Scheduler decision opportunity. Maybe this Yield contains information that affects the state of our Wait list.&lt;CLICK&gt;The first thing that the Scheduler should do, however, is manage the Completion List and move any unblocked threads to the Ready List.&lt;CLICK&gt;Next, our Scheduler must make a priority decision. Maybe our Waiting thread gets to run again and our Yielding thread gets placed upon the Ready List.And we’re done…
  • UMS is an enabler for:Finer-grained parallelismMore deterministic behaviorBetter cache localityUMS allows your Scheduler to boost performance in certain situations:Apps that have a lot of blocking, for example
  • Think Tasks not Threads.Threads represent execution flow, not workHard-coded; significant system overheadMinimal intrinsic parallel constructsQueueUserWorkItem() is handy for fire-and-forgetBut what about…WaitingCancelingContinuingComposingExceptionsDataflowIntegrationDebugging
  • NOW, LET’S FIRST CONSIDER THE TOOLS ARCHICTECTURE FROM A .NET DEVELOPER’S PERSPECTIVE.LET’S START WITH THE .NET Runtime AND THE .NET Parallel Extensions library. In a moment, we’ll look at how a developer uses the extensions within their application. The .NET PARALLEL EXTENSIONS provide the benefits of concurrent task scheduling without YOU having to build a custom scheduler that is appropriately reentrant, thread-safe, and non-blocking.&lt;CLICK&gt;The Parallel Extensions library contains a Task Scheduler and a Resource Manager component that integrates with the underlying .NET Runtime. The Resource Manager manages access to system resources like the collection of available CPU’s. &lt;CLICK&gt;The Scheduler leverages only thread pools for task scheduling. &lt;CLICK&gt;The Parallel Extensions also supports multiple Programming Models. &lt;CLICK&gt;The Task Parallel Library (TPL) is an easy and convenient way to express fine-grain parallelism within your applications. The TPL provides patterns for Task Execution, Synchronization, and Data Sharing.&lt;CLICK&gt;The PLINQ (or Parallel LINQ) enables parallel query execution not only on SQL Data but also on XML or Collections Data.&lt;CLICK&gt;The Parallel Extensions also includes Data Structures that are “scheduler aware” enabling you to optimally specify task scheduling requests and custom scheduler policies.&lt;CLICK&gt;Again, Visual Studio 2010 includes new tools for parallel application development and testing. These include:&lt;CLICK&gt;A new parallel debugger. And…&lt;CLICK&gt;A new parallel application profiler.Let’s take a brief look at a simple .NET parallel application along with the Visual Studio 2010 Debugger and Parallel Performance Analyzer.Pure .NET librariesFeature areasTask Parallel LibraryParallel LINQSynchronization primitives and thread-safe data structuresEnhanced ThreadPool
  • Using Parallel Computing Platform - NHDNUG

    1. 1. Visual Studio 2010<br />Using the Parallel Computing Platform<br />Phil Pennington<br />philpenn@microsoft.com<br />
    2. 2. Agenda<br />2<br />What’s new with Windows?<br />Parallel Computing Tools in Visual Studio<br />Using .NET Parallel Extensions<br />
    3. 3. First, An ExampleMonte Carlo Approximation of Pi<br />S = 4*r*r <br />C = Pi*r*r<br />Pi = 4*(C/S)<br />For each Point (P),<br />d(P) = SQRT((x * x) + (y * y))<br />if (d < r) thenP(x,y) is in C<br />
    4. 4. Windows and Maximum Processors<br />Before Win7/R2, the maximum number of Logical Processors (LPs) was dictated by processor integral word size<br />LP state (e.g. idle, affinity) represented in word-sized bitmask<br />32-bit Windows: 32 LPs<br />64-bit Windows: 64 LPs<br />32-bit Idle Processor Mask<br />31<br />0<br />16<br />Busy<br />Idle<br />
    5. 5. Processor GroupsNew with Windows7 and Windows Server R2<br />5<br />GROUP<br />NUMA NODE<br />Socket<br />Socket<br />Core<br />Core<br />LP<br />LP<br />LP<br />LP<br />Core<br />Core<br />NUMA NODE<br />
    6. 6. Processor GroupsExample: 2 Groups, 4 nodes, 8 sockets, 32 cores, 128 LP’s <br />6<br />Group<br />Group<br />NUMA Node<br />NUMA Node<br />Socket<br />Socket<br />Socket<br />Socket<br />NUMA Node<br />NUMA Node<br />Socket<br />Socket<br />Socket<br />Socket<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />Core<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />LP<br />
    7. 7. Many-Core Topology APIs Discovery<br />7<br />
    8. 8. Many-Core Topology APIs Resource Localization<br />8<br />
    9. 9. Many-Core Topology APIs Memory Management<br />9<br />
    10. 10. User Mode SchedulingArchitectural Perspective<br />UMS Scheduler’s Ready List<br />Your Scheduler<br />Logic<br />Wait<br />Reason:<br />Yield<br />Reason:<br />Yield<br />Reason:<br />Blocked<br />Reason:<br />Created<br />CPU 1<br />CPU 2<br />UMS Completion List<br />W1<br />W2<br />W3<br />W4<br />S1<br />S2<br />Application<br />Kernel<br />Blocked Worker Threads<br />Scheduler Threads<br />
    11. 11. Task Scheduling with a UMS SchedulerMaximize Quantum, Minimize Blocking Affects<br />Tasks are run by worker threads, which the scheduler controls<br />Dead Zone<br />WT0<br />WT1<br />WT2<br />WT3<br />Without UMS (signal-and-wait)<br />WT0<br />WT1<br />WT2<br />WT3<br />With UMS (UMS yield)<br />
    12. 12. Load-Balancing, Work Stealing Scheduler<br />DynamicScheduling<br />Static Scheduling<br />CPU0<br />CPU1<br />CPU2<br />CPU3<br />CPU0<br />CPU1<br />CPU2<br />CPU3<br />Dynamic scheduling improves performance by distributing work efficiently at runtime.<br />
    13. 13. Demos<br />The Platform<br />- Topology<br />- Schedulers<br />
    14. 14. Agenda<br />14<br />What’s new with Windows?<br />Parallel Computing Tools in Visual Studio<br />Using .NET Parallel Extensions<br />
    15. 15. Visual Studio 2010, .NET Developer Tools, Programming Models, Runtimes<br />Tools<br />Programming Models – Structured Parallelism<br />Parallel LINQ<br />(PLINQ)<br />Task ParallelLibrary (TPL)<br />Debugger <br />Data Structures<br />.NET Parallel Extensions<br />Profiler<br />Task Scheduler<br />Resource Manager<br />.NET Runtime<br />Threads Pools<br />Managed Library<br />Tools<br />
    16. 16. Thread-Pool Scheduler in .NET 4.0<br />Thread 1<br />Dispatch Loop<br />Thread 2<br />Dispatch Loop<br />Thread N<br />Dispatch Loop<br />Enqueue<br />Dequeue<br />T1<br />T2<br />T3<br />T4<br />Global Queue (FIFO)<br />Dequeue<br />Enqueue<br />T5<br />Global Q is shared by legacy ThreadPool API and TPL<br />Local work queues and work stealing scheduler (TPL only)<br />T6<br />T7<br />T8<br />Steal<br />Steal<br />Steal<br />Thread 1<br />Local Queue (LIFO)<br />Thread 2<br />Local Queue (LIFO)<br />Thread N<br />Local Queue (LIFO)<br />
    17. 17. Task Parallel Library (TPL)Tasks Concepts<br />Common Functionality: waiting, cancellation, continuations, parent/child relationships<br />
    18. 18. Primitives and Structures<br />Thread-safe, scalable collections<br />IProducerConsumerCollection<T><br />ConcurrentQueue<T><br />ConcurrentStack<T><br />ConcurrentBag<T><br />ConcurrentDictionary<TKey,TValue><br />Phases and work exchange<br />Barrier <br />BlockingCollection<T><br />CountdownEvent<br />Partitioning<br />{Orderable}Partitioner<T><br />Partitioner.Create<br />Exception handling<br />AggregateException<br />Initialization<br />Lazy<T><br />LazyInitializer.EnsureInitialized<T><br />ThreadLocal<T><br />Locks<br />ManualResetEventSlim<br />SemaphoreSlim<br />SpinLock<br />SpinWait<br />Cancellation<br />CancellationToken{Source}<br />
    19. 19. Parallel Debugging<br />Two new debugger toolwindows<br />Support both native and managed<br />“Parallel Tasks”<br />“Parallel Stacks”<br />
    20. 20. Parallel Tasks<br /><ul><li>What threads are executing my tasks?
    21. 21. Where are my tasks running (location, call stack)?
    22. 22. Which tasks are blocked?
    23. 23. How many tasks are waiting to run?</li></li></ul><li>Parallel Stacks<br />Zoom <br />control<br /><ul><li>Multiple call stacks in a single view
    24. 24. Task-specific view (Task status)
    25. 25. Easy navigation to any executing method
    26. 26. Rich UI (zooming, panning, bird’s eye view, flagging, tooltips)</li></ul>Bird’s eye view<br />
    27. 27. Parallel Profiling<br />
    28. 28. CPU Utilization<br />Other processes<br />Number of cores<br />Idle time<br />Your Process<br />
    29. 29. Threads<br />Measure time for interesting segments<br />Hide uninteresting threads<br />Zoom in and out<br />Detailed thread analysis<br />(one channel per thread)<br />Active Legend<br />Usage Hints<br />Call Stacks<br />
    30. 30. Cores<br />Each logical core in a swim lane<br />One color per thread<br />Migration visualization<br />Cross-core migration details<br />
    31. 31. Demo<br />Libraries<br />Languages<br />Debuggers<br />Profilers<br />
    32. 32. Agenda<br />27<br />What’s new with Windows?<br />Parallel Computing Tools in Visual Studio<br />Using .NET Parallel Extensions<br />
    33. 33. Thinking Parallel - “Task” vs. “Data” Parallelism<br />Task Parallelism<br />Parallel.Invoke(<br /> () => { Console.WriteLine("Begin first task..."); }, <br /> () => { Console.WriteLine("Begin second task..."); }, <br /> () => { Console.WriteLine("Begin third task..."); } ); <br />Data Parallelism<br />IEnumerable<int> numbers = Enumerable.Range(2, 100-3);<br />varmyQuery = <br /> from n in numbers.AsParallel()<br /> where Enumerable.Range(2, <br />(int)Math.Sqrt(n)).All(i => n % i > 0)<br /> select n;<br />int[] primes = myQuery.ToArray();<br />
    34. 34. Thinking Parallel – How to Partition Work? <br />Several partitioning schemes built-in<br />Chunk<br />Works with any IEnumerable<T><br />Single enumerator shared; chunks handed out on-demand<br />Range<br />Works only with IList<T><br />Input divided into contiguous regions, one per partition<br />Stripe<br />Works only with IList<T><br />Elements handed out round-robin to each partition<br />Hash<br />Works with any IEnumerable<T><br />Elements assigned to partition based on hash code<br />Custom partitioning available through Partitioner<T><br />Partitioner.Create available for tighter control over built-in partitioning schemes<br />
    35. 35. Thinking Parallel – How to Execute Tasks?<br />
    36. 36. Thinking Parallel – How to Collate Results?<br />
    37. 37. Demos<br />Partition<br />Execute<br />Collate<br />
    38. 38. Resources<br />NativeAPIs/runtimes (Visual C++ 10)<br />Tasks, loops, collections, and Agents<br />http://msdn.microsoft.com/en-us/library/dd504870(VS.100).aspx<br />Tools (in the VS2010 IDE)<br />Debugger and profiler<br />http://msdn.microsoft.com/en-us/library/dd460685(VS.100).aspx<br />Managed APIs/runtimes (.NET 4)<br />Tasks, loops, collections, and PLINQ<br />http://msdn.microsoft.com/en-us/library/dd460693(VS.100).aspx<br />General VS2010 Parallel Computing Developer Center<br />http://msdn.microsoft.com/en-us/concurrency/default.aspx<br />

    ×