Your SlideShare is downloading. ×
  • Like

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Parallel Programming

  • 3,916 views
Published

All about Parallel Programming

All about Parallel Programming

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,916
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
19
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Learning and Development PresentsOPEN TALK SERIESA series of illuminating talks and interactions that open our minds to new ideasand concepts; that makes us look for newer or better ways of doing what we did;or point us to exciting things we have never done before. A range of topics onTechnology, Business, Fun and Life.Be part of the learning experience at Aditi.Join the talks. Its free. Free as in freedom at work, not free-beer.Speak at these events. Or bring an expert/friend to talk.Mail LEAD with topic and availability.
  • 2. Parallel Programming Sundararajan Subramanian Aditi Technologies2
  • 3. Introduction to Parallel Computing• The challenge – Provide the abstractions , programming paradigms, and algorithms needed to effectively design, implement, and maintain applications that exploit the parallelism provided by the underlying hardware in order to solve modern problems.
  • 4. Single-core CPU chip the single core 4
  • 5. Multi-core architectures Core 1 Core 2 Core 3 Core 4Multi-core CPU chip 5
  • 6. Multi-core CPU chip• The cores fit on a single processor socket• Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7. The cores run in parallel thread 1 thread 2 thread 3 thread 4c c c co o o or r r re e e e1 2 3 4 7
  • 8. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 8
  • 9. Instruction-level parallelism• Parallelism at the machine-instruction level• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 9
  • 10. Instruction level parallelism• For(int i-0;i<1000;i++) { a[0]++; a[0]++; }• For(int i-0;i<1000;i++) { a[0]++; a[1]++; }
  • 11. Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate thread (Web server, database server)• A computer game can do AI, graphics, and physics in three separate threads• Single-core superscalar processors cannot fully exploit TLP• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12. A technique complementary to multi-core: Simultaneous multithreading• Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Decoder Other execution units Bus wait unused BTB and I-TLB Source: Intel 12
  • 13. Simultaneous multithreading (SMT)• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core• Weaving together multiple “threads” on the same core• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 13
  • 14. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 14
  • 15. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 15
  • 16. SMT processor: both threads can run concurrently L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 16
  • 17. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is impossible with SMT Bus BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 17
  • 18. SMT not a “true” parallel processor• Enables better threading (e.g. up to 30%)• OS and applications perceive each simultaneous thread as a separate “virtual processor”• The chip has only a single copy of each resource• Compare to multi-core: each core has its own copy of resources 18
  • 19. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating PointL2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder BusBus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 19
  • 20. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating PointL2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder BusBus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 20
  • 21. Combining Multi-core and SMT• Cores can be SMT-enabled (or not)• The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines• The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads• Intel calls them “hyper-threads” 21
  • 22. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating PointL2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder BusBus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 22
  • 23. Designs with private L2 caches CORE0CORE1 CORE0 CORE1 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2
  • 24. Private vs shared caches?• Advantages/disadvantages? 25
  • 25. Private vs shared caches• Advantages of private: – They are closer to core, so faster access – Reduces contention• Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 26
  • 26. Parallel Architectures• Use multiple – Datapaths – Memory units – Processing units
  • 27. Parallel Architectures• SIMD – Single instruction stream, multiple data stream Processing Unit Processing Unit Interconnect Control Processing Unit Unit Processing Unit Processing Unit
  • 28. Parallel Architectures• MIMD – Multiple instruction stream, multiple data stream Processing/Control Unit Processing/Control Interconnect Unit Processing/Control Unit Processing/Control Unit
  • 29. Parallelism in Visual Studio 2010Integrated Programming Models Programming ModelsTooling PLINQ Parallel Task Parallel Parallel Pattern Agents Debugger Library Library LibraryToolwindows Data Structures Data Structures Concurrency Runtime Concurrency Runtime ThreadPool Profiler Task SchedulerConcurrency Task Scheduler Analysis Resource Manager Resource Manager Operating System Threads Key: Tools Native Library Managed Library
  • 30. Multi threading Today• Divide the total number of activites across n processors• In case of 2 Procs, divide it by 2.
  • 31. User Mode SchedulerCLR Thread Pool Global Queue Worker … Worker Thread 1 Thread pProgram Thread
  • 32. User Mode Scheduler For Tasks CLR Thread Pool: Work-Stealing Local … Local Global Queue Queue Queue Worker … Worker Thread 1 Thread p Task 6Task 1 Task Task 3 4 Task 2Program Task 5 Thread
  • 33. DEMO
  • 34. Task-based Programming ThreadPool SummaryThreadPool.QueueUserWorkItem(…);System.Threading.TasksStarting Parent/ChildTask.Factory.StartNew(…); var p = new Task(() => { var t = new Task(…); });Continue/Wait/CancelTask t = … Tasks with results Task<int> f =Task p = t.ContinueWith(…); new Task<int>(() => C());t.Wait(2000); …t.Cancel(); int result = f.Result;
  • 35. Coordination Data Structures (1 of 3) Block if fullConcurrent Collections P C• BlockingCollection<T> P C• ConcurrentBag<T> P C• ConcurrentDictionary<TKey,TValu e> Block if empty• ConcurrentLinkedList<T>• ConcurrentQueue<T>• ConcurrentStack<T>• IProducerConsumerCollection<T>• Partitioner, Partitioner<T>, OrderablePartitioner<T>
  • 36. Coordination Data Structures (2 of 3)Synchronization Primitives• Barrier• CountdownEvent Loop• ManualResetEventSlim Barrier postPhaseAction• SemaphoreSlim• SpinLock• SpinWait CountdownEvent.
  • 37. Coordination Data Structures (3 of 3)Initialization Primitives• Lazy<T>, LazyVariable<T>, LazyInitializer Cancellation MyMethod( )• ThreadLocal<T> Source Foo(…, CancellationToken ct)Cancellation Primitives Thread Boundary• CancellationToken• CancellationTokenSource Bar(…, CancellationToken ct)• ICancelableOperation ManualResetEventSlim.Wait( ct ) Cancellation Token