Multicore programmingandtpl(.net day)

Yan Drugalya
Multicore Programming

  Prepared by: Yan Druglaya
 e-mail: ydrugalya@gmail.com
     Twitter: @ydrugalya
Agenda
 Part 1 - Current state of affairs
 Part 2 - Multithreaded algorithms
 Part 3 – Task Parallel Library
Multicore Programming
Part 1: Current state of affairs
Why Moore's law is not working
anymore
 Power consumption
 Wire delays
 DRAM access latency
 Diminishing returns of more instruction-level
 parallelism
Power consumption

                                                                                             Sun‟s Surface
                        10,000


                         1,000                                                 Rocket Nozzle
Power Density (W/cm2)




                          100                                      Nuclear Reactor



                           10                                   Pentium® processors

                                                                                      Hot Plate
                            1
                                   8080
                                                                  486

                                                          386

                             „70                   „80    ‟90               ‟00                        „10
     Intel Developer Forum, Spring 2004 - Pat Gelsinger
Wire delays
DRAM access latency
Diminishing returns
 80‟s
   10 CPI  1 CPI
 90
   1 CPI  0.5CPI
 00‟s: multicore
The Free Lunch Is Over. A
Fundamental Turn Toward
Concurrency in Software

                        Herb Sutter
Survival




 To scale performance, put many processing cores on the
  microprocessor chip
 New Moore’s law edition is about doubling of cores.
Quotations
 No matter how fast processors get, software
  consistently finds new ways to eat up the extra speed
 If you haven‟t done so already, now is the time to take
  a hard look at the design of your application,
  determine what operations are CPU-sensitive now or
  are likely to become so soon, and identify how those
  places could benefit from concurrency.”
        -- Herb Sutter, C++ Architect at Microsoft (March
2005)
 After decades of single core processors, the high
  volume processor industry has gone from single to
  dual to quad-core in just the last two years. Moore‟s
  Law scaling should easily let us hit the 80-core mark
  in mainstream processors within the next ten years
  and quite possibly even less.
               -- Justin Rattner, CTO, Intel (February
2007)
What keeps us away from multicore
 Sequential way of thinking
 Believe that parallel programming is difficult and
  error-prone
 Unwilling to accept the fact that sequential era is
  over
 Neglecting performance
What have been done
 Many frameworks have been created, that brings
  parallelism at application level.
 Vendors hardly tries to teach programming
  community how to write parallel programs
 MIT and other education centers did a lot of
  researches in this area
Multicore Programming
Part 2: Multithreaded algorithms
Chapter 27 Multithreaded
Algorithms
Multithreaded algorithms
 No single architecture of parallel
  computer  no single and wide
  accepted model of parallel
  computing
 We rely on parallel shared memory
  computer
Dynamic multithreaded model(DMM)
 Allows programmer to operate with “logical
  parallelism” without worrying about any issues of
  static programming
 Two main features are:
   Nested parallelism (parent can proceed while
    spawned child is computing its result)
   Parallel loop (iteration of the loop can execute
    concurrently)
DMM - advantages
 Simple extension of “serial model”. Only 3 new
  keywords: parallel, spawn and sync.
 Provides theoretically clean way of quantify
  parallelism based on notions of “work” and
  “span”
 Many MT algorithms based on nested parallelism
  a naturally follows from divide and conquer
  approach
Multithreaded execution model
Work

Span

Speedup

Parallelism

Performance summary

Example: fib(4)
Scheduler role

Analyzing MT algorithms: Matrix
multiplication
P-Square-Matrix-Multiply:
1. n = a.rows
2. let C be new NxN matrix
3. parallel for i = 1 to n
4. parallel for j = 1 to n
5.     Cij = 0
6.     for k 1 to n
7.          Cij= Cij + Aik * B kj
Analyzing MT algorithms: Matrix
multiplication

Chess Lesson

Multicore Programming
Part 2: Task Parallel Library
TPL building blocks
 Consist of:
  - Tasks
  - Tread Safe Scalable Collections
  - Phases and Work Exchange
  - Partitioning
  - Looping
  - Control
  - Breaking
  - Exceptions
  - Results
Data parallelism




Parallel.ForEach(letters, ch => Capitalize(ch));
Task parallelism




Parallel.Invoke(() => Average(), () => Minimum()
…);
Thread Pool in .net 3.5
Thread Pool in .NET 4.0
Task Scheduler & Thread pool
 3.5 ThreadPool.QueueUserWorkItem
 disadvantages:
  Zero information about each work item
  Fairness FIFO queue maintain
 Improvements:
  More efficient FIFO queue (ConcurrentQueue)
  Enhance the API to get more information from user
    Task
    Work stealing
    Threads injections
    Wait completion, handling exceptions, getting computation
     result
New Primitives
 Thread-safe, scalable collections        AggregateException
   IProducerConsumerCollection<T>      Initialization
       ConcurrentQueue<T>                 Lazy<T>
       ConcurrentStack<T>                    LazyInitializer.EnsureInitialized<T>
       ConcurrentBag<T>                   ThreadLocal<T>
    ConcurrentDictionary<TKey,TValu
     e>
                                        Locks
                                          ManualResetEventSlim
 Phases and work exchange
                                          SemaphoreSlim
   Barrier
                                          SpinLock
   BlockingCollection<T>
                                          SpinWait
   CountdownEvent

                                        Cancellation
 Partitioning
                                          CancellationToken{Source}
   {Orderable}Partitioner<T>
       Partitioner.Create



 Exception handling
References
 The Free Lunch Is Over: A Fundamental Turn
    Toward Concurrency in Software
   MIT Introduction to algorithms video lectures
   Chapter 27 Multithreaded Algorithms from
    Introduction to algorithms 3rd edition
   CLR 4.0 ThreadPool Improvements: Part 1
   Multicore Programming Primer
   ThreadPool on Channel 9
1 of 38

More Related Content

Viewers also liked(9)

Patterns of parallel programmingPatterns of parallel programming
Patterns of parallel programming
Alex Tumanoff2.7K views
Multi core programming 1Multi core programming 1
Multi core programming 1
Robin Aggarwal700 views
Multi core programming 2Multi core programming 2
Multi core programming 2
Robin Aggarwal611 views
Cache coherenceCache coherence
Cache coherence
Shyam Krishna Khadka3.1K views
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
Arush Nagpal8.9K views
Cache coherenceCache coherence
Cache coherence
Employee22.8K views
 distributed shared memory distributed shared memory
distributed shared memory
Ashish Kumar100.6K views

Similar to Multicore programmingandtpl(.net day)(20)

Multicore programmingandtpl(.net day)

  • 1. Multicore Programming Prepared by: Yan Druglaya e-mail: ydrugalya@gmail.com Twitter: @ydrugalya
  • 2. Agenda  Part 1 - Current state of affairs  Part 2 - Multithreaded algorithms  Part 3 – Task Parallel Library
  • 3. Multicore Programming Part 1: Current state of affairs
  • 4. Why Moore's law is not working anymore  Power consumption  Wire delays  DRAM access latency  Diminishing returns of more instruction-level parallelism
  • 5. Power consumption Sun‟s Surface 10,000 1,000 Rocket Nozzle Power Density (W/cm2) 100 Nuclear Reactor 10 Pentium® processors Hot Plate 1 8080 486 386 „70 „80 ‟90 ‟00 „10 Intel Developer Forum, Spring 2004 - Pat Gelsinger
  • 8. Diminishing returns  80‟s  10 CPI  1 CPI  90  1 CPI  0.5CPI  00‟s: multicore
  • 9. The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software Herb Sutter
  • 10. Survival  To scale performance, put many processing cores on the microprocessor chip  New Moore’s law edition is about doubling of cores.
  • 11. Quotations  No matter how fast processors get, software consistently finds new ways to eat up the extra speed  If you haven‟t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency.” -- Herb Sutter, C++ Architect at Microsoft (March 2005)  After decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore‟s Law scaling should easily let us hit the 80-core mark in mainstream processors within the next ten years and quite possibly even less. -- Justin Rattner, CTO, Intel (February 2007)
  • 12. What keeps us away from multicore  Sequential way of thinking  Believe that parallel programming is difficult and error-prone  Unwilling to accept the fact that sequential era is over  Neglecting performance
  • 13. What have been done  Many frameworks have been created, that brings parallelism at application level.  Vendors hardly tries to teach programming community how to write parallel programs  MIT and other education centers did a lot of researches in this area
  • 14. Multicore Programming Part 2: Multithreaded algorithms
  • 16. Multithreaded algorithms  No single architecture of parallel computer  no single and wide accepted model of parallel computing  We rely on parallel shared memory computer
  • 17. Dynamic multithreaded model(DMM)  Allows programmer to operate with “logical parallelism” without worrying about any issues of static programming  Two main features are:  Nested parallelism (parent can proceed while spawned child is computing its result)  Parallel loop (iteration of the loop can execute concurrently)
  • 18. DMM - advantages  Simple extension of “serial model”. Only 3 new keywords: parallel, spawn and sync.  Provides theoretically clean way of quantify parallelism based on notions of “work” and “span”  Many MT algorithms based on nested parallelism a naturally follows from divide and conquer approach
  • 27. Analyzing MT algorithms: Matrix multiplication P-Square-Matrix-Multiply: 1. n = a.rows 2. let C be new NxN matrix 3. parallel for i = 1 to n 4. parallel for j = 1 to n 5. Cij = 0 6. for k 1 to n 7. Cij= Cij + Aik * B kj
  • 28. Analyzing MT algorithms: Matrix multiplication 
  • 30. Multicore Programming Part 2: Task Parallel Library
  • 31. TPL building blocks  Consist of: - Tasks - Tread Safe Scalable Collections - Phases and Work Exchange - Partitioning - Looping - Control - Breaking - Exceptions - Results
  • 33. Task parallelism Parallel.Invoke(() => Average(), () => Minimum() …);
  • 34. Thread Pool in .net 3.5
  • 35. Thread Pool in .NET 4.0
  • 36. Task Scheduler & Thread pool  3.5 ThreadPool.QueueUserWorkItem disadvantages:  Zero information about each work item  Fairness FIFO queue maintain  Improvements:  More efficient FIFO queue (ConcurrentQueue)  Enhance the API to get more information from user  Task  Work stealing  Threads injections  Wait completion, handling exceptions, getting computation result
  • 37. New Primitives  Thread-safe, scalable collections  AggregateException  IProducerConsumerCollection<T>  Initialization  ConcurrentQueue<T>  Lazy<T>  ConcurrentStack<T>  LazyInitializer.EnsureInitialized<T>  ConcurrentBag<T>  ThreadLocal<T>  ConcurrentDictionary<TKey,TValu e>  Locks  ManualResetEventSlim  Phases and work exchange  SemaphoreSlim  Barrier  SpinLock  BlockingCollection<T>  SpinWait  CountdownEvent  Cancellation  Partitioning  CancellationToken{Source}  {Orderable}Partitioner<T>  Partitioner.Create  Exception handling
  • 38. References  The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software  MIT Introduction to algorithms video lectures  Chapter 27 Multithreaded Algorithms from Introduction to algorithms 3rd edition  CLR 4.0 ThreadPool Improvements: Part 1  Multicore Programming Primer  ThreadPool on Channel 9