• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction To Parallel Computing
 

Introduction To Parallel Computing

on

  • 1,400 views

Introduction to parallel computing. This talk gives a first introduction into parallel, concurrent and distributed computing.

Introduction to parallel computing. This talk gives a first introduction into parallel, concurrent and distributed computing.

Statistics

Views

Total Views
1,400
Views on SlideShare
1,237
Embed Views
163

Actions

Likes
0
Downloads
0
Comments
0

5 Embeds 163

http://dinkla.net 125
http://www.dinkla.net 18
http://www.dinkla.com 15
http://dinkla.com 4
http://a0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction To Parallel Computing Introduction To Parallel Computing Presentation Transcript

    • Introduction toParallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
    • Dipl.-Inform. Jörn Dinkla Java (J2SE, JEE) Programming Languages  Scala, Groovy, Haskell Parallel Computing  GPU Computing Model driven Eclipse-Plugins
    • Overview Progress in computing Traditional Hard- and Software Theoretical Computer Science  Algorithms  Machines  Optimization Parallelization Parallel Hard- and Software
    • Progress in Computing1. New applications  Not feasible before  Not needed before  Not possible before2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
    • Progress in Computing Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
    • How was progress achieved? Hardware  CPU, memory, disks, networks  Faster and larger Software  New and better algorithms  Programming methods and languages
    • Traditional Hardware Von Neumann-Architecture CPU I/O Memory Bus John Backus 1977  “von Neumann bottleneck“ Cache
    • Improvements Increasing Clock Frequency Memory Hierarchy / Cache Parallelizing ALU Pipelining Very-long Instruction Words (VLIW) Instruction-Level parallelism (ILP) Superscalar processors Vector data types Multithreaded Multicore / Manycore
    • Moore‘s law Guaranteed until 2020
    • Clock frequency No increase since 2005
    • Physical Limits Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation Limit to transistor size Faster processors impossible !?!
    • 2005“The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
    • Multicore Transistor count  Doubles every 2-3 years Calculation speed  No increase Multicore Efficient?
    • How to use the cores? Multi-Tasking OS  Different tasks Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
    • Traditional Software Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries Keywords  Sequential, Serial, Deterministic
    • Traditional Software Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile) Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
    • Architecture Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
    • More than one CPU? How should they communicate ? I/O Memory I/O Memory CPU CPU
    • Message Passing Distributed system Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
    • Shared Memory Shared Memory Tight coupling I/O Memory I/O CPU CPU
    • Shared Memory Global vs. Local Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
    • Overview: Memory Unshared Memory  Message Passing  Actors Shared Memory  Threads Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS) Transactional Memory
    • Sequential Algorithms Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
    • Sequential Algorithmsint sum = 0for i=0 to 4 sum += mem[i]Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
    • More than one CPU How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
    • Two ProcessorsPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Lockstep 1 2 7 5 Memory Access! 3 4 1 2 5 18
    • Flynn‘s Taxonomy 1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
    • Flynn‘s Taxonomy SISD  RAM, Von Neumann SIMD  Lockstep, vector processor, GPU MISD  Fault tolerance MIMD  Distributed system
    • Extension MIMD How many programs? SPMD  One program  Not in lockstep as in SIMD MPMD  Many programs
    • Processes & Threads Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads Thread  Smallest unit of execution  Light weight
    • Overview: Algorithms Sequential Parallel Concurrent Overlap Distributed Randomized Quantum
    • Computer Science Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
    • Gap: Theory and Practice Galactic algorithms Written for abstract machines  PRAM, special networks, etc. Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
    • Sequential algorithms Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
    • Parallel algorithms Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
    • Parallelization Transforming  a sequential  into a parallel algorithm Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
    • Granularity “Size” of the threads?  How much computation? Coarse vs. fine grain Right choice  Important for good performance  Algorithm design
    • Computational thinking “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
    • Computational thinking “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010 Expert level needed for parallelization!
    • Problems: Shared Memory Destructive updates  i += 1 Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
    • Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3 Which one first? 1 2 7 5 3 1 4 2 5 18
    • Problems: Shared MemoryPC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum Synchronization needed
    • Problems: Shared Memory The memory barrier  When is a value read or written?  Optimizing compilers change semantics int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a Solutions (Java)  volatile  java.util.concurrent.atomic
    • Problems: Shared Memory Thread safety Reentrant code class X { int x; void inc() { x+=1; } }
    • Problems: Threads Deadlock  A wants B, B wants A, both waiting Starvation  A wants B, but never gets it Race condition  A writes to mem, B reads/writes mem
    • Shared Mem: Solutions Shared mutable state  Synchronize properly Isolated mutable state  Don‘t share state Immutable or unshared  Don‘t mutate state!
    • Solutions Transactional Memory  Every access within transaction  See databases Actor models  Message passing Immutable state / pure functional
    • Speedup and Efficiency Running time  T(1) with one processor  T(n) with two processors Speedup  How much faster?  S(n) = T(1) / T(n)
    • Speedup and Efficiency Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
    • Amdahl‘s Law
    • Amdahl‘s Law
    • Amdahl‘s Law Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
    • P-Completeness Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
    • P-Completeness
    • Optimization What can i achieve? When do I stop? How many threads should i use?
    • Optimization I/O bound  Thread is waiting for memory, disk, etc. Computation bound  Thread is calculating the whole time Watch processor utilization!
    • Optimization I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads Computation bound  Number of threads = Number of cores
    • Processors Multicore CPU Graphical Processing Unit (GPU) Field-Programmable Gate Array (FPGA)
    • GPU Computing Finer granularity than CPU  Specialized processors  512 cores on a Fermi High memory bandwidth 192 GB/sec
    • CPU vs. GPU Source: SGI
    • FPGA Configurable hardware circuits Programmed in Verilog, VHDL Now: OpenCL  Much higher level of abstraction Under development, promising No performance tests results (2011/12)
    • Networks / Cluster Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA Vast possibilities
    • Example 2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
    • Example 1 CPU with local cache Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
    • Next Step: Hybrid Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
    • Optimal combination? Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
    • Example: K-Computer SPARC64 VIIIfx 2.0GHz 705024 Cores 10.51 Petaflop/s No GPUs #1 2011
    • Example: Tianhe-1A 14336 Xeon X5670 7168 Tesla M2050 2048 NUDT FT1000 2.57 petaflop/s #2 2011
    • Example: HPC at home Workstations and blades  8 x 512 cores = 4096 cores
    • Frameworks: Shared Mem C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads Java  java.util.concurrent
    • Frameworks: Actors C/C++  Theron Java / JVM  Akka  Scala  GPars (Groovy)
    • GPU Computing NVIDIA CUDA  NVIDIA OpenCL  AMD  NVIDIA  Intel  Altera  Apple WebCL  Nokia  Samsung
    • Advanced courses Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory See http://www.dinkla.com
    • Advanced courses GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java See http://www.dinkla.com
    • References: Practice Mattson, Sanders, Massingill  Patterns for Parallel Programming Breshears  The Art of Concurrency
    • References: Practice Pacheco  An Introduction to Parallel Programming Herlihy, Shavit  The Art of Multiprocessor Programming
    • References: Theory Gibbons, Rytter  Efficient Parallel Algorithms Lynch  Distributed Algorithms Ben-Ari  Principles of Concurrent and Distributed Programming
    • References: GPU Computing Scarpino  OpenCL in Action Sanders, Kandrot  CUDA by Example
    • References: Background Hennessy, Paterson  Computer Architecture: A Quantitative Approach