Introduction toParallel Computing     Jörn Dinkla     http://www.dinkla.com          Version 1.1
Dipl.-Inform. Jörn Dinkla Java (J2SE, JEE) Programming Languages   Scala, Groovy, Haskell Parallel Computing   GPU Co...
Overview Progress in computing Traditional Hard- and Software Theoretical Computer Science   Algorithms   Machines  ...
Progress in Computing1. New applications   Not feasible before   Not needed before   Not possible before2. Better appli...
Progress in Computing Two ingredients   Hardware     Machine(s) to execute program   Software     Model / language to...
How was progress achieved? Hardware   CPU, memory, disks, networks   Faster and larger Software   New and better algo...
Traditional Hardware Von Neumann-Architecture           CPU    I/O   Memory                  Bus John Backus 1977   “vo...
Improvements   Increasing Clock Frequency   Memory Hierarchy / Cache   Parallelizing ALU   Pipelining   Very-long Ins...
Moore‘s law Guaranteed until 2020
Clock frequency No increase since 2005
Physical Limits Increase of clock frequency   >>> Energy-consumption   >>> Heat-dissipation Limit to transistor size  ...
2005“The Free Lunch Is Over:   A Fundamental Turn Toward   Concurrency in Software”       Herb Sutter       Dr. Dobb’s Jou...
Multicore Transistor count    Doubles every 2-3 years Calculation speed    No increase  Multicore Efficient?
How to use the cores? Multi-Tasking OS   Different tasks Speeding up same task     Assume 2 CPUs     Problem is divid...
Traditional Software Computation is expressed as “algorithm    “a step-by-step procedure for calculations”    algorithm...
Traditional Software Improvements   Better algorithms   Programming languages (OO)   Developement methods (agile) Lim...
Architecture Simplification: Ignore the bus    CPU    I/O   Memory                            I/O         Memory         ...
More than one CPU? How should they communicate ?   I/O         Memory     I/O         Memory         CPU                 ...
Message Passing Distributed system Loose coupling                                      Messages                         ...
Shared Memory Shared Memory Tight coupling            I/O         Memory         I/O                  CPU            CPU
Shared Memory Global vs. Local Memory hierarchy     I/O         Memory            I/O         Memory                    ...
Overview: Memory Unshared Memory   Message Passing   Actors Shared Memory   Threads Memory hierarchies / hybrid   P...
Sequential Algorithms Random Access Machine (RAM)   Step by step, deterministic                                  Addr Va...
Sequential Algorithmsint sum = 0for i=0 to 4  sum += mem[i]Addr Value   Addr Value   Addr Value   Addr Value   Addr Value ...
More than one CPU How many programs should run?   One     In lock-step        All processors do the same     In any o...
Two ProcessorsPC 1   int sum = 0              int sum = 0       for i=0 to 2      PC 2   for i=3 to 4         sum += mem[i...
Flynn‘s Taxonomy 1966                        Instruction                     Single    Multiple           Single     SISD...
Flynn‘s Taxonomy SISD   RAM, Von Neumann SIMD   Lockstep, vector processor, GPU MISD   Fault tolerance MIMD   Dist...
Extension MIMD How many programs? SPMD   One program   Not in lockstep as in SIMD MPMD   Many programs
Processes & Threads Process   Operating System      Address space      IPC   Heavy weight   Contains 1..* threads T...
Overview: Algorithms   Sequential   Parallel   Concurrent    Overlap   Distributed   Randomized   Quantum
Computer Science Theoretical Computer Science     A long time before 2005     1989: Gibbons, Rytter     1990: Ben-Ari ...
Gap: Theory and Practice Galactic algorithms Written for abstract machines   PRAM, special networks, etc. Simplifying ...
Sequential algorithms Implementing a sequential algorithm   Machine architecture   Programming language   Performance ...
Parallel algorithms Implementing a parallel algorithm   Adapt algorithm to architecture      No PRAM or sorting network...
Parallelization Transforming   a sequential   into a parallel algorithm Tasks   Adapt to architecture   Rewrite   T...
Granularity “Size” of the threads?   How much computation? Coarse vs. fine grain Right choice   Important for good pe...
Computational thinking “… is the thought processes involved  in formulating problems and their  solutions so that the sol...
Computational thinking “… is the new literacy of the 21st  Century.”               Cuny, Snyder, Wing 2010 Expert level ...
Problems: Shared Memory Destructive updates   i += 1 Parallel, independent processes   How do the others now that i in...
Problems: Shared MemoryPC 1   int sum = 0              int sum = 0       for i=0 to 2      PC 2   for i=3 to 4         sum...
Problems: Shared MemoryPC 1   int sum = 0              int sum = 0       for i=0 to 2      PC 2   for i=3 to 4         sum...
Problems: Shared Memory The memory barrier    When is a value read or written?    Optimizing compilers change semantics...
Problems: Shared Memory Thread safety Reentrant code  class X {    int x;    void inc() { x+=1; }  }
Problems: Threads Deadlock   A wants B, B wants A, both waiting Starvation   A wants B, but never gets it Race condit...
Shared Mem: Solutions Shared mutable state   Synchronize properly Isolated mutable state   Don‘t share state Immutabl...
Solutions Transactional Memory   Every access within transaction   See databases Actor models   Message passing Immu...
Speedup and Efficiency Running time   T(1) with one processor   T(n) with two processors Speedup   How much faster?  ...
Speedup and Efficiency Efficiency   Are all the processors used?   E(n) = S(n) / n = T(1) / (n * T(n))
Amdahl‘s Law
Amdahl‘s Law
Amdahl‘s Law Corrolary   Maximize the parallel part   Only parallelize when parallel part is large    enough
P-Completeness Is there an efficient parallel version for  every algorithm?   No! Hardly parallelizable problems   P-Co...
P-Completeness
Optimization What can i achieve? When do I stop? How many threads should i use?
Optimization I/O bound   Thread is waiting for memory, disk, etc. Computation bound   Thread is calculating the whole ...
Optimization I/O bound   Use asynchronous/non-blocking I/O   Increase number of threads Computation bound   Number of...
Processors Multicore CPU Graphical Processing Unit (GPU) Field-Programmable Gate Array  (FPGA)
GPU Computing Finer granularity than CPU   Specialized processors   512 cores on a Fermi High memory bandwidth 192 GB/...
CPU vs. GPU Source: SGI
FPGA Configurable hardware circuits Programmed in Verilog, VHDL Now: OpenCL   Much higher level of abstraction Under ...
Networks / Cluster Combination of             CPU     CPU                   Memory     Memory     Network             ...
Example 2 x connected by network   2 CPU each with local cache   Global memory                            Network  CPU ...
Example 1 CPU with local cache Connected by shared memory   2 GPU with local memory („device“)         CPU      Memory ...
Next Step: Hybrid Hybrid / Heterogenous   Multi-Core / Many-Core   Plus special purpose hardware     GPU     FPGA
Optimal combination? Which network gives the best  performance?   Complicated   Technical restrictions      4x PCI-Exp...
Example: K-Computer   SPARC64 VIIIfx 2.0GHz   705024 Cores   10.51 Petaflop/s   No GPUs #1 2011
Example: Tianhe-1A   14336 Xeon X5670   7168 Tesla M2050   2048 NUDT FT1000   2.57 petaflop/s #2 2011
Example: HPC at home Workstations and blades   8 x 512 cores = 4096 cores
Frameworks: Shared Mem C/C++     OpenMP     POSIX Threads (pthreads)     Intel Thread Building Blocks     Windows Thr...
Frameworks: Actors C/C++   Theron Java / JVM   Akka   Scala   GPars (Groovy)
GPU Computing NVIDIA CUDA   NVIDIA OpenCL     AMD     NVIDIA     Intel     Altera     Apple WebCL   Nokia   Sam...
Advanced courses Best practices for concurrency in Java   Java‘s java.util.concurrent   Actor models   Transactional M...
Advanced courses GPU Computing     NVIDIA CUDA     OpenCL     Using NVIDIA CUDA with Java     Using OpenCL with Java...
References: Practice Mattson, Sanders, Massingill   Patterns for    Parallel Programming Breshears   The Art of Concur...
References: Practice Pacheco   An Introduction to    Parallel Programming Herlihy, Shavit   The Art of    Multiprocess...
References: Theory Gibbons, Rytter   Efficient Parallel Algorithms Lynch   Distributed Algorithms Ben-Ari   Principl...
References: GPU Computing Scarpino   OpenCL in Action Sanders, Kandrot   CUDA by Example
References: Background Hennessy, Paterson   Computer Architecture: A Quantitative    Approach
Upcoming SlideShare
Loading in...5
×

Introduction To Parallel Computing

1,413

Published on

Introduction to parallel computing. This talk gives a first introduction into parallel, concurrent and distributed computing.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,413
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction To Parallel Computing

  1. 1. Introduction toParallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  2. 2. Dipl.-Inform. Jörn Dinkla Java (J2SE, JEE) Programming Languages  Scala, Groovy, Haskell Parallel Computing  GPU Computing Model driven Eclipse-Plugins
  3. 3. Overview Progress in computing Traditional Hard- and Software Theoretical Computer Science  Algorithms  Machines  Optimization Parallelization Parallel Hard- and Software
  4. 4. Progress in Computing1. New applications  Not feasible before  Not needed before  Not possible before2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  5. 5. Progress in Computing Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  6. 6. How was progress achieved? Hardware  CPU, memory, disks, networks  Faster and larger Software  New and better algorithms  Programming methods and languages
  7. 7. Traditional Hardware Von Neumann-Architecture CPU I/O Memory Bus John Backus 1977  “von Neumann bottleneck“ Cache
  8. 8. Improvements Increasing Clock Frequency Memory Hierarchy / Cache Parallelizing ALU Pipelining Very-long Instruction Words (VLIW) Instruction-Level parallelism (ILP) Superscalar processors Vector data types Multithreaded Multicore / Manycore
  9. 9. Moore‘s law Guaranteed until 2020
  10. 10. Clock frequency No increase since 2005
  11. 11. Physical Limits Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation Limit to transistor size Faster processors impossible !?!
  12. 12. 2005“The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  13. 13. Multicore Transistor count  Doubles every 2-3 years Calculation speed  No increase Multicore Efficient?
  14. 14. How to use the cores? Multi-Tasking OS  Different tasks Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  15. 15. Traditional Software Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries Keywords  Sequential, Serial, Deterministic
  16. 16. Traditional Software Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile) Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  17. 17. Architecture Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
  18. 18. More than one CPU? How should they communicate ? I/O Memory I/O Memory CPU CPU
  19. 19. Message Passing Distributed system Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  20. 20. Shared Memory Shared Memory Tight coupling I/O Memory I/O CPU CPU
  21. 21. Shared Memory Global vs. Local Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  22. 22. Overview: Memory Unshared Memory  Message Passing  Actors Shared Memory  Threads Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS) Transactional Memory
  23. 23. Sequential Algorithms Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  24. 24. Sequential Algorithmsint sum = 0for i=0 to 4 sum += mem[i]Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  25. 25. More than one CPU How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  26. 26. Two ProcessorsPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Lockstep 1 2 7 5 Memory Access! 3 4 1 2 5 18
  27. 27. Flynn‘s Taxonomy 1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  28. 28. Flynn‘s Taxonomy SISD  RAM, Von Neumann SIMD  Lockstep, vector processor, GPU MISD  Fault tolerance MIMD  Distributed system
  29. 29. Extension MIMD How many programs? SPMD  One program  Not in lockstep as in SIMD MPMD  Many programs
  30. 30. Processes & Threads Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads Thread  Smallest unit of execution  Light weight
  31. 31. Overview: Algorithms Sequential Parallel Concurrent Overlap Distributed Randomized Quantum
  32. 32. Computer Science Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  33. 33. Gap: Theory and Practice Galactic algorithms Written for abstract machines  PRAM, special networks, etc. Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  34. 34. Sequential algorithms Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  35. 35. Parallel algorithms Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  36. 36. Parallelization Transforming  a sequential  into a parallel algorithm Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  37. 37. Granularity “Size” of the threads?  How much computation? Coarse vs. fine grain Right choice  Important for good performance  Algorithm design
  38. 38. Computational thinking “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  39. 39. Computational thinking “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010 Expert level needed for parallelization!
  40. 40. Problems: Shared Memory Destructive updates  i += 1 Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  41. 41. Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Which one first? 1 2 7 5 3 1 4 2 5 18
  42. 42. Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum Synchronization needed
  43. 43. Problems: Shared Memory The memory barrier  When is a value read or written?  Optimizing compilers change semantics int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a Solutions (Java)  volatile  java.util.concurrent.atomic
  44. 44. Problems: Shared Memory Thread safety Reentrant code class X { int x; void inc() { x+=1; } }
  45. 45. Problems: Threads Deadlock  A wants B, B wants A, both waiting Starvation  A wants B, but never gets it Race condition  A writes to mem, B reads/writes mem
  46. 46. Shared Mem: Solutions Shared mutable state  Synchronize properly Isolated mutable state  Don‘t share state Immutable or unshared  Don‘t mutate state!
  47. 47. Solutions Transactional Memory  Every access within transaction  See databases Actor models  Message passing Immutable state / pure functional
  48. 48. Speedup and Efficiency Running time  T(1) with one processor  T(n) with two processors Speedup  How much faster?  S(n) = T(1) / T(n)
  49. 49. Speedup and Efficiency Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  50. 50. Amdahl‘s Law
  51. 51. Amdahl‘s Law
  52. 52. Amdahl‘s Law Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  53. 53. P-Completeness Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  54. 54. P-Completeness
  55. 55. Optimization What can i achieve? When do I stop? How many threads should i use?
  56. 56. Optimization I/O bound  Thread is waiting for memory, disk, etc. Computation bound  Thread is calculating the whole time Watch processor utilization!
  57. 57. Optimization I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads Computation bound  Number of threads = Number of cores
  58. 58. Processors Multicore CPU Graphical Processing Unit (GPU) Field-Programmable Gate Array (FPGA)
  59. 59. GPU Computing Finer granularity than CPU  Specialized processors  512 cores on a Fermi High memory bandwidth 192 GB/sec
  60. 60. CPU vs. GPU Source: SGI
  61. 61. FPGA Configurable hardware circuits Programmed in Verilog, VHDL Now: OpenCL  Much higher level of abstraction Under development, promising No performance tests results (2011/12)
  62. 62. Networks / Cluster Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA Vast possibilities
  63. 63. Example 2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  64. 64. Example 1 CPU with local cache Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  65. 65. Next Step: Hybrid Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  66. 66. Optimal combination? Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  67. 67. Example: K-Computer SPARC64 VIIIfx 2.0GHz 705024 Cores 10.51 Petaflop/s No GPUs #1 2011
  68. 68. Example: Tianhe-1A 14336 Xeon X5670 7168 Tesla M2050 2048 NUDT FT1000 2.57 petaflop/s #2 2011
  69. 69. Example: HPC at home Workstations and blades  8 x 512 cores = 4096 cores
  70. 70. Frameworks: Shared Mem C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads Java  java.util.concurrent
  71. 71. Frameworks: Actors C/C++  Theron Java / JVM  Akka  Scala  GPars (Groovy)
  72. 72. GPU Computing NVIDIA CUDA  NVIDIA OpenCL  AMD  NVIDIA  Intel  Altera  Apple WebCL  Nokia  Samsung
  73. 73. Advanced courses Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory See http://www.dinkla.com
  74. 74. Advanced courses GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java See http://www.dinkla.com
  75. 75. References: Practice Mattson, Sanders, Massingill  Patterns for Parallel Programming Breshears  The Art of Concurrency
  76. 76. References: Practice Pacheco  An Introduction to Parallel Programming Herlihy, Shavit  The Art of Multiprocessor Programming
  77. 77. References: Theory Gibbons, Rytter  Efficient Parallel Algorithms Lynch  Distributed Algorithms Ben-Ari  Principles of Concurrent and Distributed Programming
  78. 78. References: GPU Computing Scarpino  OpenCL in Action Sanders, Kandrot  CUDA by Example
  79. 79. References: Background Hennessy, Paterson  Computer Architecture: A Quantitative Approach

×