Your SlideShare is downloading. ×
0
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Introduction To Parallel Computing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction To Parallel Computing

1,315

Published on

Introduction to parallel computing. This talk gives a first introduction into parallel, concurrent and distributed computing.

Introduction to parallel computing. This talk gives a first introduction into parallel, concurrent and distributed computing.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,315
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction toParallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  • 2. Dipl.-Inform. Jörn Dinkla Java (J2SE, JEE) Programming Languages  Scala, Groovy, Haskell Parallel Computing  GPU Computing Model driven Eclipse-Plugins
  • 3. Overview Progress in computing Traditional Hard- and Software Theoretical Computer Science  Algorithms  Machines  Optimization Parallelization Parallel Hard- and Software
  • 4. Progress in Computing1. New applications  Not feasible before  Not needed before  Not possible before2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  • 5. Progress in Computing Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  • 6. How was progress achieved? Hardware  CPU, memory, disks, networks  Faster and larger Software  New and better algorithms  Programming methods and languages
  • 7. Traditional Hardware Von Neumann-Architecture CPU I/O Memory Bus John Backus 1977  “von Neumann bottleneck“ Cache
  • 8. Improvements Increasing Clock Frequency Memory Hierarchy / Cache Parallelizing ALU Pipelining Very-long Instruction Words (VLIW) Instruction-Level parallelism (ILP) Superscalar processors Vector data types Multithreaded Multicore / Manycore
  • 9. Moore‘s law Guaranteed until 2020
  • 10. Clock frequency No increase since 2005
  • 11. Physical Limits Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation Limit to transistor size Faster processors impossible !?!
  • 12. 2005“The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  • 13. Multicore Transistor count  Doubles every 2-3 years Calculation speed  No increase Multicore Efficient?
  • 14. How to use the cores? Multi-Tasking OS  Different tasks Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  • 15. Traditional Software Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries Keywords  Sequential, Serial, Deterministic
  • 16. Traditional Software Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile) Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  • 17. Architecture Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
  • 18. More than one CPU? How should they communicate ? I/O Memory I/O Memory CPU CPU
  • 19. Message Passing Distributed system Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  • 20. Shared Memory Shared Memory Tight coupling I/O Memory I/O CPU CPU
  • 21. Shared Memory Global vs. Local Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  • 22. Overview: Memory Unshared Memory  Message Passing  Actors Shared Memory  Threads Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS) Transactional Memory
  • 23. Sequential Algorithms Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  • 24. Sequential Algorithmsint sum = 0for i=0 to 4 sum += mem[i]Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  • 25. More than one CPU How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  • 26. Two ProcessorsPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Lockstep 1 2 7 5 Memory Access! 3 4 1 2 5 18
  • 27. Flynn‘s Taxonomy 1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  • 28. Flynn‘s Taxonomy SISD  RAM, Von Neumann SIMD  Lockstep, vector processor, GPU MISD  Fault tolerance MIMD  Distributed system
  • 29. Extension MIMD How many programs? SPMD  One program  Not in lockstep as in SIMD MPMD  Many programs
  • 30. Processes & Threads Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads Thread  Smallest unit of execution  Light weight
  • 31. Overview: Algorithms Sequential Parallel Concurrent Overlap Distributed Randomized Quantum
  • 32. Computer Science Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  • 33. Gap: Theory and Practice Galactic algorithms Written for abstract machines  PRAM, special networks, etc. Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  • 34. Sequential algorithms Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  • 35. Parallel algorithms Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  • 36. Parallelization Transforming  a sequential  into a parallel algorithm Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  • 37. Granularity “Size” of the threads?  How much computation? Coarse vs. fine grain Right choice  Important for good performance  Algorithm design
  • 38. Computational thinking “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  • 39. Computational thinking “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010 Expert level needed for parallelization!
  • 40. Problems: Shared Memory Destructive updates  i += 1 Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  • 41. Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Which one first? 1 2 7 5 3 1 4 2 5 18
  • 42. Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum Synchronization needed
  • 43. Problems: Shared Memory The memory barrier  When is a value read or written?  Optimizing compilers change semantics int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a Solutions (Java)  volatile  java.util.concurrent.atomic
  • 44. Problems: Shared Memory Thread safety Reentrant code class X { int x; void inc() { x+=1; } }
  • 45. Problems: Threads Deadlock  A wants B, B wants A, both waiting Starvation  A wants B, but never gets it Race condition  A writes to mem, B reads/writes mem
  • 46. Shared Mem: Solutions Shared mutable state  Synchronize properly Isolated mutable state  Don‘t share state Immutable or unshared  Don‘t mutate state!
  • 47. Solutions Transactional Memory  Every access within transaction  See databases Actor models  Message passing Immutable state / pure functional
  • 48. Speedup and Efficiency Running time  T(1) with one processor  T(n) with two processors Speedup  How much faster?  S(n) = T(1) / T(n)
  • 49. Speedup and Efficiency Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  • 50. Amdahl‘s Law
  • 51. Amdahl‘s Law
  • 52. Amdahl‘s Law Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  • 53. P-Completeness Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  • 54. P-Completeness
  • 55. Optimization What can i achieve? When do I stop? How many threads should i use?
  • 56. Optimization I/O bound  Thread is waiting for memory, disk, etc. Computation bound  Thread is calculating the whole time Watch processor utilization!
  • 57. Optimization I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads Computation bound  Number of threads = Number of cores
  • 58. Processors Multicore CPU Graphical Processing Unit (GPU) Field-Programmable Gate Array (FPGA)
  • 59. GPU Computing Finer granularity than CPU  Specialized processors  512 cores on a Fermi High memory bandwidth 192 GB/sec
  • 60. CPU vs. GPU Source: SGI
  • 61. FPGA Configurable hardware circuits Programmed in Verilog, VHDL Now: OpenCL  Much higher level of abstraction Under development, promising No performance tests results (2011/12)
  • 62. Networks / Cluster Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA Vast possibilities
  • 63. Example 2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  • 64. Example 1 CPU with local cache Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  • 65. Next Step: Hybrid Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  • 66. Optimal combination? Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  • 67. Example: K-Computer SPARC64 VIIIfx 2.0GHz 705024 Cores 10.51 Petaflop/s No GPUs #1 2011
  • 68. Example: Tianhe-1A 14336 Xeon X5670 7168 Tesla M2050 2048 NUDT FT1000 2.57 petaflop/s #2 2011
  • 69. Example: HPC at home Workstations and blades  8 x 512 cores = 4096 cores
  • 70. Frameworks: Shared Mem C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads Java  java.util.concurrent
  • 71. Frameworks: Actors C/C++  Theron Java / JVM  Akka  Scala  GPars (Groovy)
  • 72. GPU Computing NVIDIA CUDA  NVIDIA OpenCL  AMD  NVIDIA  Intel  Altera  Apple WebCL  Nokia  Samsung
  • 73. Advanced courses Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory See http://www.dinkla.com
  • 74. Advanced courses GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java See http://www.dinkla.com
  • 75. References: Practice Mattson, Sanders, Massingill  Patterns for Parallel Programming Breshears  The Art of Concurrency
  • 76. References: Practice Pacheco  An Introduction to Parallel Programming Herlihy, Shavit  The Art of Multiprocessor Programming
  • 77. References: Theory Gibbons, Rytter  Efficient Parallel Algorithms Lynch  Distributed Algorithms Ben-Ari  Principles of Concurrent and Distributed Programming
  • 78. References: GPU Computing Scarpino  OpenCL in Action Sanders, Kandrot  CUDA by Example
  • 79. References: Background Hennessy, Paterson  Computer Architecture: A Quantitative Approach

×