Introduction To Parallel Computing

2,190 views

Published on

Introduction to parallel computing. This talk gives a first introduction into parallel, concurrent and distributed computing.

Published in: Technology
3 Comments
1 Like
Statistics
Notes
  • I have Found a better PPT on ThesisScientist.com on the same Topic
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • please send me this on costantine@mananga.org
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • plz send me this ppt on patelpritam194@gmail.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,190
On SlideShare
0
From Embeds
0
Number of Embeds
270
Actions
Shares
0
Downloads
0
Comments
3
Likes
1
Embeds 0
No embeds

No notes for slide

Introduction To Parallel Computing

  1. 1. Introduction toParallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  2. 2. Dipl.-Inform. Jörn Dinkla Java (J2SE, JEE) Programming Languages  Scala, Groovy, Haskell Parallel Computing  GPU Computing Model driven Eclipse-Plugins
  3. 3. Overview Progress in computing Traditional Hard- and Software Theoretical Computer Science  Algorithms  Machines  Optimization Parallelization Parallel Hard- and Software
  4. 4. Progress in Computing1. New applications  Not feasible before  Not needed before  Not possible before2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  5. 5. Progress in Computing Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  6. 6. How was progress achieved? Hardware  CPU, memory, disks, networks  Faster and larger Software  New and better algorithms  Programming methods and languages
  7. 7. Traditional Hardware Von Neumann-Architecture CPU I/O Memory Bus John Backus 1977  “von Neumann bottleneck“ Cache
  8. 8. Improvements Increasing Clock Frequency Memory Hierarchy / Cache Parallelizing ALU Pipelining Very-long Instruction Words (VLIW) Instruction-Level parallelism (ILP) Superscalar processors Vector data types Multithreaded Multicore / Manycore
  9. 9. Moore‘s law Guaranteed until 2020
  10. 10. Clock frequency No increase since 2005
  11. 11. Physical Limits Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation Limit to transistor size Faster processors impossible !?!
  12. 12. 2005“The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  13. 13. Multicore Transistor count  Doubles every 2-3 years Calculation speed  No increase Multicore Efficient?
  14. 14. How to use the cores? Multi-Tasking OS  Different tasks Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  15. 15. Traditional Software Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries Keywords  Sequential, Serial, Deterministic
  16. 16. Traditional Software Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile) Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  17. 17. Architecture Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
  18. 18. More than one CPU? How should they communicate ? I/O Memory I/O Memory CPU CPU
  19. 19. Message Passing Distributed system Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  20. 20. Shared Memory Shared Memory Tight coupling I/O Memory I/O CPU CPU
  21. 21. Shared Memory Global vs. Local Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  22. 22. Overview: Memory Unshared Memory  Message Passing  Actors Shared Memory  Threads Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS) Transactional Memory
  23. 23. Sequential Algorithms Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  24. 24. Sequential Algorithmsint sum = 0for i=0 to 4 sum += mem[i]Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  25. 25. More than one CPU How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  26. 26. Two ProcessorsPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Lockstep 1 2 7 5 Memory Access! 3 4 1 2 5 18
  27. 27. Flynn‘s Taxonomy 1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  28. 28. Flynn‘s Taxonomy SISD  RAM, Von Neumann SIMD  Lockstep, vector processor, GPU MISD  Fault tolerance MIMD  Distributed system
  29. 29. Extension MIMD How many programs? SPMD  One program  Not in lockstep as in SIMD MPMD  Many programs
  30. 30. Processes & Threads Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads Thread  Smallest unit of execution  Light weight
  31. 31. Overview: Algorithms Sequential Parallel Concurrent Overlap Distributed Randomized Quantum
  32. 32. Computer Science Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  33. 33. Gap: Theory and Practice Galactic algorithms Written for abstract machines  PRAM, special networks, etc. Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  34. 34. Sequential algorithms Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  35. 35. Parallel algorithms Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  36. 36. Parallelization Transforming  a sequential  into a parallel algorithm Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  37. 37. Granularity “Size” of the threads?  How much computation? Coarse vs. fine grain Right choice  Important for good performance  Algorithm design
  38. 38. Computational thinking “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  39. 39. Computational thinking “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010 Expert level needed for parallelization!
  40. 40. Problems: Shared Memory Destructive updates  i += 1 Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  41. 41. Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Which one first? 1 2 7 5 3 1 4 2 5 18
  42. 42. Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum Synchronization needed
  43. 43. Problems: Shared Memory The memory barrier  When is a value read or written?  Optimizing compilers change semantics int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a Solutions (Java)  volatile  java.util.concurrent.atomic
  44. 44. Problems: Shared Memory Thread safety Reentrant code class X { int x; void inc() { x+=1; } }
  45. 45. Problems: Threads Deadlock  A wants B, B wants A, both waiting Starvation  A wants B, but never gets it Race condition  A writes to mem, B reads/writes mem
  46. 46. Shared Mem: Solutions Shared mutable state  Synchronize properly Isolated mutable state  Don‘t share state Immutable or unshared  Don‘t mutate state!
  47. 47. Solutions Transactional Memory  Every access within transaction  See databases Actor models  Message passing Immutable state / pure functional
  48. 48. Speedup and Efficiency Running time  T(1) with one processor  T(n) with two processors Speedup  How much faster?  S(n) = T(1) / T(n)
  49. 49. Speedup and Efficiency Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  50. 50. Amdahl‘s Law
  51. 51. Amdahl‘s Law
  52. 52. Amdahl‘s Law Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  53. 53. P-Completeness Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  54. 54. P-Completeness
  55. 55. Optimization What can i achieve? When do I stop? How many threads should i use?
  56. 56. Optimization I/O bound  Thread is waiting for memory, disk, etc. Computation bound  Thread is calculating the whole time Watch processor utilization!
  57. 57. Optimization I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads Computation bound  Number of threads = Number of cores
  58. 58. Processors Multicore CPU Graphical Processing Unit (GPU) Field-Programmable Gate Array (FPGA)
  59. 59. GPU Computing Finer granularity than CPU  Specialized processors  512 cores on a Fermi High memory bandwidth 192 GB/sec
  60. 60. CPU vs. GPU Source: SGI
  61. 61. FPGA Configurable hardware circuits Programmed in Verilog, VHDL Now: OpenCL  Much higher level of abstraction Under development, promising No performance tests results (2011/12)
  62. 62. Networks / Cluster Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA Vast possibilities
  63. 63. Example 2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  64. 64. Example 1 CPU with local cache Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  65. 65. Next Step: Hybrid Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  66. 66. Optimal combination? Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  67. 67. Example: K-Computer SPARC64 VIIIfx 2.0GHz 705024 Cores 10.51 Petaflop/s No GPUs #1 2011
  68. 68. Example: Tianhe-1A 14336 Xeon X5670 7168 Tesla M2050 2048 NUDT FT1000 2.57 petaflop/s #2 2011
  69. 69. Example: HPC at home Workstations and blades  8 x 512 cores = 4096 cores
  70. 70. Frameworks: Shared Mem C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads Java  java.util.concurrent
  71. 71. Frameworks: Actors C/C++  Theron Java / JVM  Akka  Scala  GPars (Groovy)
  72. 72. GPU Computing NVIDIA CUDA  NVIDIA OpenCL  AMD  NVIDIA  Intel  Altera  Apple WebCL  Nokia  Samsung
  73. 73. Advanced courses Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory See http://www.dinkla.com
  74. 74. Advanced courses GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java See http://www.dinkla.com
  75. 75. References: Practice Mattson, Sanders, Massingill  Patterns for Parallel Programming Breshears  The Art of Concurrency
  76. 76. References: Practice Pacheco  An Introduction to Parallel Programming Herlihy, Shavit  The Art of Multiprocessor Programming
  77. 77. References: Theory Gibbons, Rytter  Efficient Parallel Algorithms Lynch  Distributed Algorithms Ben-Ari  Principles of Concurrent and Distributed Programming
  78. 78. References: GPU Computing Scarpino  OpenCL in Action Sanders, Kandrot  CUDA by Example
  79. 79. References: Background Hennessy, Paterson  Computer Architecture: A Quantitative Approach

×