Parallel Programming

4,494 views

Published on

All about Parallel Programming

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,494
On SlideShare
0
From Embeds
0
Number of Embeds
1,652
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Parallel Programming

  1. 1. Learning and Development PresentsOPEN TALK SERIESA series of illuminating talks and interactions that open our minds to new ideasand concepts; that makes us look for newer or better ways of doing what we did;or point us to exciting things we have never done before. A range of topics onTechnology, Business, Fun and Life.Be part of the learning experience at Aditi.Join the talks. Its free. Free as in freedom at work, not free-beer.Speak at these events. Or bring an expert/friend to talk.Mail LEAD with topic and availability.
  2. 2. Parallel Programming Sundararajan Subramanian Aditi Technologies2
  3. 3. Introduction to Parallel Computing• The challenge – Provide the abstractions , programming paradigms, and algorithms needed to effectively design, implement, and maintain applications that exploit the parallelism provided by the underlying hardware in order to solve modern problems.
  4. 4. Single-core CPU chip the single core 4
  5. 5. Multi-core architectures Core 1 Core 2 Core 3 Core 4Multi-core CPU chip 5
  6. 6. Multi-core CPU chip• The cores fit on a single processor socket• Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 6
  7. 7. The cores run in parallel thread 1 thread 2 thread 3 thread 4c c c co o o or r r re e e e1 2 3 4 7
  8. 8. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 8
  9. 9. Instruction-level parallelism• Parallelism at the machine-instruction level• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 9
  10. 10. Instruction level parallelism• For(int i-0;i<1000;i++) { a[0]++; a[0]++; }• For(int i-0;i<1000;i++) { a[0]++; a[1]++; }
  11. 11. Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate thread (Web server, database server)• A computer game can do AI, graphics, and physics in three separate threads• Single-core superscalar processors cannot fully exploit TLP• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  12. 12. A technique complementary to multi-core: Simultaneous multithreading• Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Decoder Other execution units Bus wait unused BTB and I-TLB Source: Intel 12
  13. 13. Simultaneous multithreading (SMT)• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core• Weaving together multiple “threads” on the same core• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 13
  14. 14. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 14
  15. 15. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 15
  16. 16. SMT processor: both threads can run concurrently L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 16
  17. 17. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is impossible with SMT Bus BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 17
  18. 18. SMT not a “true” parallel processor• Enables better threading (e.g. up to 30%)• OS and applications perceive each simultaneous thread as a separate “virtual processor”• The chip has only a single copy of each resource• Compare to multi-core: each core has its own copy of resources 18
  19. 19. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating PointL2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder BusBus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 19
  20. 20. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating PointL2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder BusBus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 20
  21. 21. Combining Multi-core and SMT• Cores can be SMT-enabled (or not)• The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines• The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads• Intel calls them “hyper-threads” 21
  22. 22. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating PointL2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder BusBus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 22
  23. 23. Designs with private L2 caches CORE0CORE1 CORE0 CORE1 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2
  24. 24. Private vs shared caches?• Advantages/disadvantages? 25
  25. 25. Private vs shared caches• Advantages of private: – They are closer to core, so faster access – Reduces contention• Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 26
  26. 26. Parallel Architectures• Use multiple – Datapaths – Memory units – Processing units
  27. 27. Parallel Architectures• SIMD – Single instruction stream, multiple data stream Processing Unit Processing Unit Interconnect Control Processing Unit Unit Processing Unit Processing Unit
  28. 28. Parallel Architectures• MIMD – Multiple instruction stream, multiple data stream Processing/Control Unit Processing/Control Interconnect Unit Processing/Control Unit Processing/Control Unit
  29. 29. Parallelism in Visual Studio 2010Integrated Programming Models Programming ModelsTooling PLINQ Parallel Task Parallel Parallel Pattern Agents Debugger Library Library LibraryToolwindows Data Structures Data Structures Concurrency Runtime Concurrency Runtime ThreadPool Profiler Task SchedulerConcurrency Task Scheduler Analysis Resource Manager Resource Manager Operating System Threads Key: Tools Native Library Managed Library
  30. 30. Multi threading Today• Divide the total number of activites across n processors• In case of 2 Procs, divide it by 2.
  31. 31. User Mode SchedulerCLR Thread Pool Global Queue Worker … Worker Thread 1 Thread pProgram Thread
  32. 32. User Mode Scheduler For Tasks CLR Thread Pool: Work-Stealing Local … Local Global Queue Queue Queue Worker … Worker Thread 1 Thread p Task 6Task 1 Task Task 3 4 Task 2Program Task 5 Thread
  33. 33. DEMO
  34. 34. Task-based Programming ThreadPool SummaryThreadPool.QueueUserWorkItem(…);System.Threading.TasksStarting Parent/ChildTask.Factory.StartNew(…); var p = new Task(() => { var t = new Task(…); });Continue/Wait/CancelTask t = … Tasks with results Task<int> f =Task p = t.ContinueWith(…); new Task<int>(() => C());t.Wait(2000); …t.Cancel(); int result = f.Result;
  35. 35. Coordination Data Structures (1 of 3) Block if fullConcurrent Collections P C• BlockingCollection<T> P C• ConcurrentBag<T> P C• ConcurrentDictionary<TKey,TValu e> Block if empty• ConcurrentLinkedList<T>• ConcurrentQueue<T>• ConcurrentStack<T>• IProducerConsumerCollection<T>• Partitioner, Partitioner<T>, OrderablePartitioner<T>
  36. 36. Coordination Data Structures (2 of 3)Synchronization Primitives• Barrier• CountdownEvent Loop• ManualResetEventSlim Barrier postPhaseAction• SemaphoreSlim• SpinLock• SpinWait CountdownEvent.
  37. 37. Coordination Data Structures (3 of 3)Initialization Primitives• Lazy<T>, LazyVariable<T>, LazyInitializer Cancellation MyMethod( )• ThreadLocal<T> Source Foo(…, CancellationToken ct)Cancellation Primitives Thread Boundary• CancellationToken• CancellationTokenSource Bar(…, CancellationToken ct)• ICancelableOperation ManualResetEventSlim.Wait( ct ) Cancellation Token

×