[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Upcoming SlideShare
Loading in...5
×
 

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

on

  • 1,578 views

http://cs264.org

http://cs264.org

Statistics

Views

Total Views
1,578
Views on SlideShare
1,575
Embed Views
3

Actions

Likes
1
Downloads
68
Comments
0

2 Embeds 3

http://www.docshut.com 2
http://mel.np.edu.sg 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns Presentation Transcript

  • Massively Parallel Computing CS 264 / CSCI E-292Lecture #2: Architecture, Theory & Patterns | February 1st, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • Objectives• introduce important computational thinking skills for massively parallel computing• understand hardware limitations• understand algorithm constraints• identify common patterns
  • During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-) View slide
  • Outline• Thinking Parallel• Architecture• Programming Model• Bits of Theory• Patterns View slide
  • ti vat i on Mo! 7F"/.;$"#.2./1#2%/C"&.O#./0.2"2$; 12+2E-I1,,6.%C,""<"&88"+&! P1;$.&1#+,,8! -*Q;3"$O+;$"& " P+&6I+&"&"+#F123O&"R%"2#8,1/1$+$1.2;! S.I! -*Q;3"$I16"& slide by Matthew Bolitho
  • ti vat i on Mo! T+$F"&$F+2":0"#$123-*Q;$.3"$$I1#"+; O+;$9":0"#$$.F+<"$I1#"+;/+28U! *+&+,,",0&.#";;123O.&$F"/+;;";! Q2O.&$%2+$",8)*+&+,,",0&.3&+//1231;F+&6V " D,3.&1$F/;+26B+$+?$&%#$%&";/%;$C" O%26+/"2$+,,8&"6";132"6 slide by Matthew Bolitho
  • Thinking Parallel
  • Getting your feet wet• Common scenario: “I want to make the algorithm X run faster, help me!”• Q: How do you approach the problem?
  • How?
  • How?• Option 1: wait• Option 2: gcc -O3 -msse4.2• Option 3: xlc -O5• Option 4: use parallel libraries (e.g. (cu)blas)• Option 5: hand-optimize everything!• Option 6: wait more
  • What else ?
  • How about analysis ?
  • Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() Q: What is the maximum speed up ?
  • Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() A: 2X ! :-(
  • Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 100x100x100 9,000 9,000 100% parallelizable 6,750 sequential in naturetime (s) 4,500 2,250 0 350 250 300 load_data() foo() bar() yey() Q: and now?
  • You need to...• ... understand the problem (duh!)• ... study the current (sequential?) solutions and their constraints• ... know the input domain• ... profile accordingly• ... “refactor” based on new constraints (hw/sw)
  • A better way ? ... ale! t sc es n’ doSpeculation: (input) domain-aware optimization usingsome sort of probabilistic modeling ?
  • Some PerspectiveThe “problem tree” for scientific problem solving 9 Some Perspective Technical Problem to be Analyzed Consultation with experts Scientific Model "A" Model "B" Theoretical analysis Discretization "A" Discretization "B" Experiments Iterative equation solver Direct elimination equation solver Parallel implementation Sequential implementation Figure 11: There“problem tree” for to try to achieve the same goal. are many The are many options scientific problem solving. There options to try to achieve the same goal. from Scott et al. “Scientific Parallel Computing” (2005)
  • Computational Thinking• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources• requires a deep understanding of their relationships adapted from Hwu & Kirk (PASI 2011)
  • Getting ready... Programming ModelsArchitecture Algorithms Languages Patterns il ers C omp Parallel Thinking Parallel Computing APPLICATIONS adapted from Scott et al. “Scientific Parallel Computing” (2005)
  • Fundamental Skills• Computer architecture• Programming models and compilers• Algorithm techniques and patterns• Domain knowledge
  • Computer Architecturecritical in understanding tradeoffs btw algorithms • memory organization, bandwidth and latency; caching and locality (memory hierarchy) • floating-point precision vs. accuracy • SISD, SIMD, MISD, MIMD vs. SIMT, SPMD
  • Programming models for optimal data structure and code execution• parallel execution models (threading hierarchy)• optimal memory access patterns• array data layout and loop transformations
  • Algorithms and patterns• toolbox for designing good parallel algorithms• it is critical to understand their scalability and efficiency• many have been exposed and documented• sometimes hard to “extract”• ... but keep trying!
  • Domain Knowledge• abstract modeling• mathematical properties• accuracy requirements• coming back to the drawing board to expose more/better parallelism ?
  • You can do it!• thinking parallel is not as hard as you may think• many techniques have been thoroughly explained...• ... and are now “accessible” to non-experts !
  • Architecture
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  • What’s in a computer?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What’s in a computer? Processor Intel Q6600 Core2 Quad, 2.4 GHzadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What’s in a computer? Die Processor (2×) 143 mm2 , 2 × 2 cores Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors ∼ 100Wadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What’s in a computer?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What’s in a computer? Memoryadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines
  • A Basic Processor Memory Interface Address ALU Address Bus Data Bus Register File Flags Internal Bus Insn. fetch PC Data ALU Control Unit (loosely based on Intel 8086)adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How all of this fits together Everything synchronizes to the Clock. Control Unit (“CU”): The brains of the Memory Interface operation. Everything connects to it. Address ALU Address Bus Data Bus Bus entries/exits are gated and Register File Flags (potentially) buffered. Internal Bus CU controls gates, tells other units Insn. fetch PC Control Unit Data ALU about ‘what’ and ‘how’: • What operation? • Which register? • Which addressing mode?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What is. . . an ALU? Arithmetic Logic Unit One or two operands A, B Operation selector (Op): • (Integer) Addition, Subtraction • (Logical) And, Or, Not • (Bitwise) Shifts (equivalent to multiplication by power of two) • (Integer) Multiplication, Division Specialized ALUs: • Floating Point Unit (FPU) • Address ALU Operates on binary representations of numbers. Negative numbers represented by two’s complement.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What is. . . a Register File? Registers are On-Chip Memory %r0 • Directly usable as operands in %r1 Machine Language %r2 • Often “general-purpose” %r3 • Sometimes special-purpose: Floating %r4 point, Indexing, Accumulator %r5 • Small: x86 64: 16×64 bit GPRs %r6 • Very fast (near-zero latency) %r7adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK Observation: Access (and addressing) happens in bus-width-size “chunks”.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • What is. . . a Memory Interface? Memory Interface gets and stores binary words in off-chip memory. Smallest granularity: Bus width Tells outside memory • “where” through address bus • “what” through data bus Computer main memory is “Dynamic RAM” (DRAM): Slow, but small and cheap.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  • A Very Simple Program 4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp) b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp) int a = 5; 12: 8b 45 f4 mov −0xc(%rbp),%eax int b = 17; 15: 0f af 45 f8 imul −0x8(%rbp),%eax int z = a ∗ b; 19: 89 45 fc mov %eax,−0x4(%rbp) 1c: 8b 45 fc mov −0x4(%rbp),%eax Things to know: • Addressing modes (Immediate, Register, Base plus Offset) • 0xHexadecimal • “AT&T Form”: (we’ll use this) <opcode><size> <source>, <dest>adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • A Very Simple Program: Intel Form 4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5 b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11 12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc] 15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8] 19: 89 45 fc mov DWORD PTR [rbp−0x4],eax 1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4] • “Intel Form”: (you might see this on the net) <opcode> <sized dest>, <sized source> • Goal: Reading comprehension. • Don’t understand an opcode? Google “<opcode> intel instruction”.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Machine Language Loops 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) { b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax for ( i = 0; 17: 01 45 f8 add %eax,−0x8(%rbp) y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) return y; 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax } 27: c9 leaveq 28: c3 retq Things to know: • Condition Codes (Flags): Zero, Sign, Carry, etc. • Call Stack: Stack frame, stack pointer, base pointer • ABI: Calling conventionsadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Machine Language Loops 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) { b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax for ( i = 0; 17: 01 45 f8 add %eax,−0x8(%rbp) y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) return y; 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax } 27: c9 leaveq 28: c3 retq Things to know: Want to make those yourself? • Condition Codes (Flags): Zero, Sign, Carry, etc. Write myprogram.c. • Call Stack:-c myprogram.c $ cc Stack frame, stack pointer, base pointer • ABI: $ objdump --disassemble myprogram.o Calling conventionsadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer:adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster!adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster! Goal now: Understand sources of slowness, and how they get addressed. Remember: High Performance Computingadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • The High-Performance Mindset Writing high-performance Codes Mindset: What is going to be the limiting factor? • ALU? • Memory? • Communication? (if multi-machine) Benchmark the assumed limiting factor right away. Evaluate • Know your peak throughputs (roughly) • Are you getting close? • Are you tracking the right limiting factor?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  • Source of Slowness: Memory Memory is slow. Distinguish two different versions of “slow”: • Bandwidth • Latency → Memory has long latency, but can have large bandwidth. Size of die vs. distance to memory: big! Dynamic RAM: long intrinsic latency!adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Source of Slowness: Memory Memory is slow. Distinguish two different versions of “slow”: • Bandwidth • Latency → Memory has long latency, but can have large bandwidth. Idea: Put a look-up table of recently-used data onto the chip. Size of die vs. distance to memory: big! → “Cache” Dynamic RAM: long intrinsic latency!adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • The Memory Hierarchy Hierarchy of increasingly bigger, slower memories: faster Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) biggeradapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Performance of computer system Performance of computer system Entire problem fits within registers Entire problem fits within cachefrom Scott et al. “Scientific Parallel Computing” (2005) Entire problem fits within main memory Problem requires Size of problem being solved Size of problem being solved secondary (disk) memory for system! Performance Impact on Problem too big
  • The Memory Hierarchy Hierarchy of increasingly bigger, slower memories: Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) How might data locality factor into this? What is a working set?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Cache: Actual Implementation Demands on cache implementation: • Fast, small, cheap, low-power • Fine-grained • High “hit”-rate (few “misses”) Problem: Goals at odds with each other: Access matching logic expensive! Solution 1: More data per unit of access matching logic → Larger “Cache Lines” Solution 2: Simpler/less access matching logic → Less than full “Associativity” Other choices: Eviction strategy, sizeadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Cache: Associativity Direct Mapped 2-way set associative Memory Cache Memory Cache 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 6 6 . . . . . .adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Cache: Associativity Direct Mapped 2-way set associative Memory Cache Memory Cache 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 6 6 . . . . . . Miss rate versus cache size on the Integer por- tion of SPEC CPU2000 [Cantin, Hill 2003]adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Cache Example: Intel Q6600/Core2 Quad --- L1 data cache --- fully associative cache = false threads sharing this cache = 0x0 (0) processor cores on this die= 0x3 (3) system coherency line size = 0x3f (63) ways of associativity = 0x7 (7) number of sets - 1 (s) = 63 --- L1 instruction --- fully associative cache = false --- L2 unified cache --- threads sharing this cache = 0x0 (0) fully associative cache false processor cores on this die= 0x3 (3) threads sharing this cache = 0x1 (1) system coherency line size = 0x3f (63) processor cores on this die= 0x3 (3) ways of associativity = 0x7 (7) system coherency line size = 0x3f (63) number of sets - 1 (s) = 63 ways of associativity = 0xf (15) number of sets - 1 (s) = 4095 More than you care to know about your CPU: http://www.etallen.com/cpuid.htmladapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Measuring the Cache I void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary [ i ] ∗= 17; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Measuring the Cache I void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary [ i ] ∗= 17; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Measuring the Cache II void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i ∗16) & asm1] ++; free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Measuring the Cache II void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i ∗16) & asm1] ++; free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Measuring the Cache III void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary [p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Measuring the Cache III void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary [p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Mike Bauer (Stanford)
  • http://sequoia.stanford.edu/Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  • Source of Slowness: Sequential Operation IF Instruction fetch ID Instruction Decode EX Execution MEM Memory Read/Write WB Result Writebackadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Solution: Pipeliningadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Pipelining (MIPS, 110,000 transistors)adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Issues with Pipelines Pipelines generally help performance–but not always. Possible issues: • Stalls • Dependent Instructions • Branches (+Prediction) • Self-Modifying Code “Solution”: Bubbling, extra circuitryadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Intel Q6600 Pipelineadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Intel Q6600 Pipeline New concept: Instruction-level parallelism (“Superscalar”)adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Programming for the Pipeline How to upset a processor pipeline: for (int i = 0; i < 1000; ++i) for (int j = 0; j < 1000; ++j) { if ( j % 2 == 0) do something(i , j ); } . . . why is this bad?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • A Puzzle int steps = 256 ∗ 1024 ∗ 1024; int [] a = new int[2]; // Loop 1 for (int i =0; i<steps; i ++) { a[0]++; a[0]++; } // Loop 2 for (int i =0; i<steps; i ++) { a[0]++; a[1]++; } Which is faster? . . . and why?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Two useful Strategies Loop unrolling: for (int i = 0; i < 500; i+=2) { for (int i = 0; i < 1000; ++i) do something(i ); → do something(i ); do something(i+1); } Software pipelining: for (int i = 0; i < 500; i+=2) for (int i = 0; i < 1000; ++i) { { do a( i ); do a( i ); → do a( i +1); do b(i ); do b(i ); } do b(i+1); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • SIMD Control Units are large and expensive. SIMD Instruction Pool Functional Units are simple and cheap. → Increase the Function/Control ratio: Data Pool Control several functional units with one control unit. All execute same operation. GCC vector extensions: typedef int v4si attribute (( vector size (16))); v4si a, b, c; c = a + b; // +, −, ∗, /, unary minus, ˆ, |, &, ˜, % Will revisit for OpenCL, GPUs.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  • GPUs ?! 6401-@&)*(&+,3AB0-3-407:&C,(,DDD& C(*8D+4/! E*(&3(,-4043*(4&@@0.,3@&3*&?">&3A,-&)D*F& .*-3(*D&,-@&@,3,&.,.A
  • Intro PyOpenCL What and Why? OpenCL“CPU-style” Cores CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13 Credit: Kayvon Fatahalian (Stanford)
  • Intro PyOpenCL What and Why? OpenCLSlimming down Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLMore Space: Double the Numberparallel) Two cores (two fragments in of Cores fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode !"#$$%&()*"+,-. !"#$$%&()*"+,-. ALU ALU &*/01.+23.453.623.&2. &*/01.+23.453.623.&2. /%1..+73.423.892:2;. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. (Execute) (Execute) /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A23.+23.+7. Execution Execution /%1..A<3.+<3.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /%1..A=3.+=3.+7. Context Context /A4..A73.1><?2@. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLFouragain . . . cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context ContextGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU → 16 independent instruction streams ALU ALU ALU Reality: instruction streams not actually 16 cores = 16very different/independent simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU (Execute) Execution Context Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU Idea #2 (Execute) Amortize cost/complexity of managing an instruction stream Execution across many ALUs Context → SIMD Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • ecall: simple processing coredd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 ALU managing an instruction Idea #2 (Execute) ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream Execution across many ALUs Ctx Ctx Ctx Context Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction Idea #2 ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction out-of-order execution So what now? slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction Idea #3 out-of-order execution Even more parallelism So what now? + Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. Ctx Ctx Ctx Ctx We’ve removedCtx Ctx Ctx Ctx caches Shared Ctx Data branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ So what now? + 33 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. 1 2 We’ve removed caches 3 4 branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ now? So what + 34 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLGPU Architecture Summary Core Ideas: 1 Many slimmed down cores → lots of parallelism 2 More ALUs, Fewer Control Units 3 Avoid memory stalls by interleaving execution of SIMD groups (“warps”) Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Is it free?! GA,3&,(&3A&.*-4H2-.4I! $(*1(,+&+243&8&+*(&C(@0.3,8D/ ! 6,3,&,..44&.*A(-.5 ! $(*1(,+&)D*F slide by Matthew Bolitho
  • dvariables. variables.uted memory private memory for each processor, only acces uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede ocessor, so no synchronization for memory accesses needemationexchanged by sending data from one processor to ano ation exchanged by sending data from one processor to an interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M PP PP PP PP PP PP Interconnection Network Interconnection Network Interconnection Network Interconnection Network M M M M M M “shared memory” approach increasingly common “distributed memory”d approach increasingly common now: mostly hybrid
  • Some terminology Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors or Some More Terminologyshared variablescores. Information exchanged between threads usingOne way to classify machines distinguishes Need to coordinate access towritten by one thread and read by another. betweenshared memory global memory can be acessed by all processors orshared variables.cores. Information exchanged between threads using shared accessibledistributed memory private memory for each processor, only variableswritten by one thread synchronization for memoryto coordinate access tothis processor, so no and read by another. Need accesses needed.shared variables.Information exchanged by sending data from one processor to anotherdistributed memory private memory for each processor, only accessiblevia an interconnection network using explicit communication operations.this processor, so no synchronization for memory accesses needed.InformationM exchanged by sending data from one processor to another M M P P Pvia an interconnection network using explicit communication operations. P M P M P M P P P Interconnection Network
  • Programming Model (Overview)
  • GPU ArchitectureCUDA Programming Model
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx (“Registers”) Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 16 kiB Ctx 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) Shared 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Program as if there were Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Consider: Which there were do automatically? Program as if is easy to Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) or Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per Sequential program → parallel hardware? core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) (Work) Item 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation or “Thread” Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 ? Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Really: Block provides Group Fetch/ Fetch/ Fetch/ Decode Decode Decode pool of parallelism to draw from. 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx block Shared Shared Shared X,Y,Z order within group Software representation matters. (Not among Hardware groups, though.) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • more next time ;-)
  • Bits of Theory (or “common sense”)
  • Speedup T (1) S(p) = T (p)• T(1): Performance of best serial algorithm• p: Number of processors• S(p) ≤ p Peter Arbenz, Andreas Adelmann, ETH Zurich
  • Efficiency S(p) T (1) E(p) = = p pT (p)• Fraction of time for which a processor does useful work• S(p) ≤ p means E(p) ≤ 1 Peter Arbenz, Andreas Adelmann, ETH Zurich
  • Amdahl’s Law ￿ ￿ 1−α T (p) = α+ T (1) p• α : Fraction of program that is sequential• Assumes that the non-sequential portion of the program parallelizes optimally Peter Arbenz, Andreas Adelmann, ETH Zurich
  • Example• Sequential portion: 10 sec• Parallel portion: 990 sec• What is the maximal speedup as p → ∞ ?
  • Solution• Sequential fraction of the code: 10 1 = = 1% 10 + 990 100• Amdahl’s Law: ￿ ￿ 0.99 T (p) = 0.01 + T (1) p• Speedup as p → ∞ T (1) 1 S(p) = → = 100 T (p) α
  • Arithmetic Intensity• : computational Work in floating-point operations• : number of Memory accesses (read and write)• Memory access is the critical issue!
  • Memory effects Examplemory access is the critical issue in high-performance computin ition 4.2 The work/memory ratio ρWM : number of floating-point operat d by number of memory locations referenced (either reads or writes). k at a book of mathematical tables tells us that π 1 1 1 1 1 1 1 =1− + − + − + − + ··· ( 4 3 5 7 9 11 13 15 y converging series good example for studying basic operation of compu m of a series of numbers: N A= ai . ( i=1
  • Speed!up of simple Pi summation 30 25 Wh y? 20 Speed!up 15 10 5 0 0 5 10 15 20 25 30 Number of ProcessorsFigure 9: Hypothetical performance of a parallel implementation of summation:speed-up. from Scott et al. “Scientific Parallel Computing” (2005)
  • Parallel efficiency of simple Pi summation 1 0.95 0.9 Wh y? 0.85 0.8 Efficiency 0.75 0.7 0.65 0.6 0.55 0.5 0 5 10 15 20 25 30 Number of ProcessorsFigure 10: Hypothetical performance of a parallel implementation of summation:efficiency. from Scott et al. “Scientific Parallel Computing” (2005)
  • Example Computation Main data done here Pathway to memory stored here Bandwidth = 1 Gbyte / secFigure 4: A simple memory model with a computational unit with only a small •amount of local memory (not shown) separated from the main memory by a path-way with limited bandwidth µ. float32 ops / sec maximum ? Q: How many •4.1 Suppose thatunit can’t be fasterwork/memoryrate ρTheorem Processing a given algorithm has a than the ratio , and data are supplied, and it might be slower WMit is implemented on a system as depicted in Figure 4 with a maximum bandwidthto memory of µ billion floating-point words per second. Then the maximumperformance that can be achieved is µρWM GFLOPS.
  • Better? Computation Local data Local data Main data done here cache here Pathway to memory stored here cache hereFigure 5: A memory model with a large local data cache separated from the mai • Yes? In theory... Why?memory by a pathway with limited bandwidth µ. • No?cache and a main memory can be modeled simplistically as Why?The performance of a two-level memory model (as depicted in Figure 5)consisting of a average cycles cache cycles =%hits × word access word access (4.3 main memory cycles
  • Cache Performance Computation Local data Local data Main data done here cache here Pathway to memory stored here cache here Computation Local data Local data Main data done here memory model withPathway to memory separated from the mai Figure 5: A cache here cache here a large local data cache stored here memory by a pathway with limited bandwidth µ.FigureThe A memory model with a large local data(as depicted in Figure 5) the mai 5: performance of a two-level memory model cache separated frommemory by a pathway with limited bandwidth µ.be modeled simplistically as consisting of a cache and a main memory can average cycles cache cycles =%hits ×The performance ofword access memory model (as depicted in Figure 5) a two-level word access (4.3 can be main memory cycles ,consisting of a cache and a main memory%hits) ×modeled simplistically as + (1 - word access average cycles cache cycles =%hits × where %hits is the fraction of cache hits among all memory references. word access word access (4.3 Figure 6 indicates the performance of a hypothetical application, depicting a main memory cycles from Scott et al. “Scientific Parallel Computing” (2005)
  • Cache Performance Cache Performance!"#"$%$&"()(*+,%-",.(" CD6?@?"#"&E(1&4("*/*(2"5(1".(.%1/"3-2$1+*,%-"!*/*("#",.("0%1"&"23-4("51%*(22%1"*/*(" 1.322"#"*&*F(".322"1&$("6*%+-$"#"$%$&"-+.7(1"%0"3-2$1+*,%-2" 1F3$"#"*&*F("F3$"1&$("689:"#"-+.7(1"%0"89:"3-2$1+*,%-2";(<4<"1(432$(1"="1(432$(1>" CD6?@?G?6HH"#"*/*(2"5(1"*&*F(".322"6?@?"#"-+.7(1"%0".(.%1/"&**(22"3-2$1+*,%-2";"(<4<"%&AB"2$%1(>" CD6?@?GI6!#*/*(2"5(1"*&*F("F3$"CD6"#"&E(1&4("*/*(2"5(1"3-2$1+*,%-2" ?89:"#"3-2$1+*,%-".3)"0%1"89:"3-2$1+*,%-2"CD689:"#"&E(1&4("*/*(2"5(1"89:"3-2$1+*,%-2" ??@?"#"3-2$1+*,%-".3)"0%1".(.%1/"&**(22"3-2$1+*,%-" from V. Sarkar (COMP 322, 2009)
  • Cache PerformanceCache Performance: Example from V. Sarkar (COMP 322, 2009)
  • Algorithmic Parallel Complexity = execution time on TP = processorsexecutiComputation graph abstraction (DAG):Node: arbitrary sequential computationEdge: dependenceAssume: identical processorsexecuting one node at a time adapted from V. Sarkar (COMP 322, 2009)
  • Algorithmic Parallel Complexity Algorithmic Com = execution time on T execution t processorsexecuti TPP== “work complexity”total number of operations performed 16 COMP 322, Fa adapted from V. Sarkar (COMP 322, 2009)
  • Algorithmic Parallel Complexity Algorithmic Com = execution time on TTP = execution processorsexecuti P = “work complexity” “step complexity”minimum number of steps when* also called:critical path length orcomputational depth adapted from V. Sarkar (COMP 322, 2009) 17
  • Algorithmic Parallel Complexity = execution time on TP = processorsexecutiLower bounds: adapted from V. Sarkar (COMP 322, 2009)
  • Parallel Complexity Algorithmic Com = execution time on TP = processorsexecutionParallelism (i.e ideal speed-up): adapted from V. Sarkar (COMP 322, 2009) 17
  • Example 1: Array Sum! Example (sequential version) Example 1: Array Sum! "•! Problem: (sequentialof the elements X[0] version) " compute the sum Sequential Version Array Sum: … X[n-1] of array X•! Problem: compute the sum of the elements X[0] … X[n-1] of•! Sequential algorithm array X•! Sequential=algorithm ( i=0 ; i< n ; i++ ) sum += X[i]; —! sum 0; for•! —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i]; Computation graph•! Computation graph 0 X[0] 0 X[0] + X[1] + X[1] + + X[2] X[2] + + … … —! Work = O(n), Span = O(n), Parallelism = O(1) —! Work = O(n), Span = O(n), Parallelism = O(1)•! How can we design an algorithm (computation graph) withadapted from V. Sarkar (COMP 322, 2009) more
  • Example Example 1: Array Iterative Version Array Sum: Parallel Sum ! (parallel iterative version) "•! Computation graph for n = 8 X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] + + + + X[0] X[2] X[4] X[6] + + X[0] X[4] + X[0] Extra dependence edges due to forall construct•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) ) adapted from V. Sarkar (COMP 322, 2009)
  • Example Array Sum: Parallel Recursive Version Example 1: Array Sum ! (parallel recursive version) "•! Computation graph for n = 8 X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] + + + + + + +•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )•! No extra dependences as in forall case adapted from V. Sarkar (COMP 322, 2009)
  • Patterns
  • Task vs Data Parallelism
  • Task parallelism• Distribute the tasks across processors based on dependency• Coarse-grain parallelism Task 1 Task 2 Time Task 3 P1 Task 1 Task 2 Task 3 Task 4 P2 Task 4 Task 5 Task 6 Task 5 Task 6 P3 Task 7 Task 8 Task 9 Task 7 Task 9 Task 8 Task assignment across 3 processors Task dependency graph 157
  • Data parallelism• Run a single kernel over many elements –Each element is independently updated –Same operation is applied on each element• Fine-grain parallelism –Many lightweight threads, easy to switch context –Maps well to ALU heavy architecture : GPU Data ……. Kernel P1 P2 P3 P4 P5 ……. Pn 158
  • Task vs. Data parallelism• Task parallel – Independent processes with little communication – Easy to use • “Free” on modern operating systems with SMP• Data parallel – Lots of data on which the same computation is being executed – No dependencies between data elements in each step in the computation – Can saturate many ALUs – But often requires redesign of traditional algorithms 4 slide by Mike Houston
  • CPU vs. GPU• CPU – Really fast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads – High performance on a single thread of execution• GPU – Lots of math units – Fast access to onboard memory – Run a program on each fragment/vertex – High throughput on parallel tasks• CPUs are great for task parallelism• GPUs are great for data parallelism slide by Mike Houston 5
  • GPU-friendly Problems• Data-parallel processing• High arithmetic intensity –Keep GPU busy all the time –Computation offsets memory latency• Coherent data access –Access large chunk of contiguous memory –Exploit fast on-chip shared memory 161
  • The Algorithm Matters• Jacobi: Parallelizable for(int i=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; }• Gauss-Seidel: Difficult to parallelize for(int i=0; i<num; i++) { v[i] = (v[i-1] + v[i+1])/2.0; } 162
  • Example: Reduction• Serial version (O(N)) for(int i=1; i<N; i++) { v[0] += v[i]; }• Parallel version (O(logN)) width = N/2; while(width > 1) { for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2; } 163
  • The Importance of Data Parallelism for GPUs P G Us• GPUs are designed for highly parallel tasks like rendering• GPUs process independent vertices and fragments – Temporary registers are zeroed – No shared or static data – No read-modify-write buffers – In short, no communication between vertices or fragments• Data-parallel processing – GPU architectures are ALU-heavy • Multiple vertex & pixel pipelines • Lots of compute power – GPU memory systems are designed to stream data • Linear access patterns can be prefetched • Hide memory latency slide by Mike Houston 6
  • #-+- !%&() $*(+%,() !%&() !"!# !"$#"&.+/*0+%1& $*(+%,() $"!# $"$# slide by Matthew Bolitho
  • Flynn’s TaxonomyEarly classification of parallel computing architectures given by M.Flynn (1972) using number of instruction streams and data streams.Still used. • Single Instruction Single Data (SISD) conventional sequential computer with one processor, single program and data storage. • Multiple Instruction Single Data (MISD) used for fault tolerance (Space Shuttle) - from Wikipedia • Single Instruction Multiple Data (SIMD) each processing element uses same instruction applied synchronously in parallel to different data elements (Connection Machine, GPUs). If-then-else statements take two steps to execute. • Multiple Instruction Multiple Data (MIMD) each processing elememt loads separate instrution and separate data elements; processors work asynchronously. Since 2006 top ten supercomputers of this type (w/o 10K node SGI Altix Columbia at NASA Ames)Update: Single Program Multiple Data (SPMD) autonomousprocessors executing same program but not in lockstep. Mostcommon style of programming. adapted from Berger & Klöckner (NYU 2010)
  • Finding Concurrency
  • ! 9)(.0/1)/16".0&7#)+/3):)#/,)$/./111):;) !"#$"#%&!"#$%&()$!*+%,+-..!,+/0 ! <03,)-%3,/#3&/1)$/.&()"-)&7)/16".0&7#) &7/&)/.)($/./:1 slide by Matthew Bolitho
  • &()*+)#,-,). &+.3.(7%8."97#,# !"#$%&()*+)#,-,). /0)1+%!"#$# &"-"%&()*+)#,-,). 2030%!"#$# &"-"%45"0,.6see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! 896)0,-5*#%(".%:%3()*+)#3%:7%:)-5%-"#$% ".3%3"-"; ! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:% >(1-3%,.%+"0"999 ! %"&";%<,.3%+"0-,-,).#%,.%-5%3"-"%-5"-%(".%:%1#3% ?09"-,@97A%,.3+.3.-97see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • &()*+)#,-,). &+.3.(7%8."97#,# !"#$%&()*+)#,-,). /0)1+%!"#$# &"-"%&()*+)#,-,). 2030%!"#$# &"-"%45"0,.6see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! !"#$%&()*(#$+,-.)*/(#"0(1."0(!"#$%&#( )*&+"$,+)#*& )*#)(#-(2-$#).3$%4(."05"0") ! 6+7(8,$9:$#-(;%"#/.9< ! =,/5:)>.?-#).,"#$@,-9< ! =,/5:)A,)#).,"#$@,-9< ! =,/5:);.*0-#$@,-9< ! =,/5:)B.+*?,:-< ! =,/5:)B,"C,"0."+@,-9< ! D50#)E,<.).,"<!"0>$,9.).<see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ;9,/5,<.).," ;5"0"9%(!"#$%<.< F#<G(;9,/5,<.).," H-,:5(F#<G< ;#)#(;9,/5,<.).," I-0-(F#<G< ;#)#(J*#-."+see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! !"#$%&()*(#$+,-.)*/(),(1."0(K#%<(),( %-"+)+)#*+./0-+- ! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K< 1 2see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! !"#$%&()*(#$+,-.)*/(),(1."0(K#%<(),( %-"+)+)#*+./0-+- ! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G< 1 2see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! !50%"0%*".7%:"7#%-)%3()*+)#%".7% 6,;.%"96)0,-5* ! 4)*-,*#%3"-"%3()*+)#%"#,97 ! 4)*-,*#%-"#$#%3()*+)#%"#,97 ! 4)*-,*#%<)-5= ! 4)*-,*#%.,-50=see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • &()*+)#,-,). &+.3.(7%8."97#,# !"#$%&()*+)#,-,). /0)1+%!"#$# &"-"%&()*+)#,-,). 2030%!"#$# &"-"%45"0,.6see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! 2.(%-5%"96)0,-5*%5"#%<.%3()*+)#3% ,.-)%3"-"%".3%-"#$#> ! 8."97? @.-0"(-,).# slide by Matthew Bolitho
  • ! !)%"#%-5%*"."6*.-%)A%3+.3.(,#% A,.3%-"#$#%-5"-%"0%#,*,9"0%".3%60)1+%-5* ! !5.%"."97?%().#-0",.-#%-)%3-0*,.%".7% .(##"07%)030see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! !"#$%&$#($#)%*%+$)$*#",#-$.$*-$*/0$&# ,0*-#%&1&#(%#%2$#&0)03%2#%*-#+2"4.#($) ! 5+6#7"3$/43%2#89*%)0/& ! :").4$;0<2%0"*%3="2/$& ! :").4$>"%0"*%3="2/$& ! :").4$80($-2%3="2/$& ! :").4$?$0+(<"42& ! :").4$?"*@"*-0*+="2/$& ! A.-%$B"&00"*&C*-;$3"/00$&see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! :").4$#@"*-$-#="2/$& ! :").4$;0<2%0"*%3="2/$& ! :").4$>"%0"*%3="2/$& ! :").4$80($-2%3="2/$& ! :").4$#?$0+(<"42& ! :").4$#?"*D@"*-0*+#="2/$& ! A.-%$B"&00"*&C*-;$3"/00$&see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! E*/$#+2"4.&#",#%&1&#%2$#0-$*0,0$-F#-%%#,3"G# /"*&2%0*&#$*,"2/$#%#.%20%3#"2-$26 ?$0+(<"2#H0& @"*-$-#="2/$& ?"*#@"*-$-#="2/$& A.-%$#B"&00"*&#%*-#;$3"/00$&see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • 8$/")."&00"* 8$.$*-$*/9#C*%39&0& !%&1#8$/")."&00"* I2"4.#!%&1& 8%%#8$/")."&00"* E2-$2#!%&1& 8%%#J(%20*+see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! !"#$%&()*++,%-(.$($.%/(-0&1%-2%)131%".% &()*)*-"1%-2%.)%($%*.$")*2*$.4%"+,5$%)6$% !"#"$%&"()*$)6)%-##0(1see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! 7)%16(*"/%#"%8$%#)$/-(*5$.%19 ! :$.;-"+, ! <22$#)*=$+,%>-#+ ! :$.;?(*)$ ! @##0A0+)$ ! B0+)*&+$%:$.CD*"/+$%?(*)$see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • +,"!-.)/0 ! 7)%*1%($.4%80)%"-)%E(*))$" ! F-%#-"1*1)$"#,%&(-8+$A1 ! :$&+*#)*-"%*"%.*1)(*80)$.%1,1)$Asee Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • 122,3#(4,/0-5.3"/ ! 7)%*1%($.%".%E(*))$" ! 7)%*1%&()*)*-"$.%*")-%1081$)1 ! !"$%)13%&$(%1081$) ! G"%.*1)(*80)$%1081$)1see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • +,"!-6(#, ! 7)%*1%($.%".%E(*))$" ! B",%)131%##$11%A",%.) ! G-"1*1)$"#,%*110$1 ! B-1)%.*22*#0+)%)-%.$+%E*)6see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14 %()*>4/5 3 %()*>4/5 4 :??%9-,@%/5* A19(/see Mattson et al “Patterns for Parallel Programming“ (2004) slide by Matthew Bolitho
  • ! :8(;$/%<#=1/%92/(&#B54(;,9" F%,30I1&#A,"- H1&9%"G14)%)#H1&9%" F14#G14)%)#H1&9%" C$)(-%#D1",-,14"#(4)#E%/19,-,%" slide by Matthew Bolitho
  • ! :8(;$/%<#=1/%92/(&#B54(;,9" F%,30I1&#A,"- !-1;,9# J11&),4(-%"G14)%)#H1&9%" F14#G14)%)#H1&9%" C$)(-%#D1",-,14"#(4)#E%/19,-,%" slide by Matthew Bolitho
  • Useful patterns (for reference)
  • Embarrassingly Parallel yi = fi (xi )where i ∈ {1, . . . , N}.Notation: (also for rest of this lecture) • xi : inputs • yi : outputs • fi : (pure) functions (i.e. no side effects) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Embarrassingly Parallel When does a function have a “side effect”? In addition to producing a value, it yan observable interaction with the i = fi (xi ) • modifies non-local state, or • has outside world.where i ∈ {1, . . . , N}.Notation: (also for rest of this lecture) • xi : inputs • yi : outputs • fi : (pure) functions (i.e. no side effects) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Embarrassingly Parallel yi = fi (xi )where i ∈ {1, . . . , N}.Notation: (also for rest of this lecture) • xi : inputs • yi : outputs • fi : (pure) functions (i.e. no side effects) Often: f1 = · · · = fN . Then • Lisp/Python function map • C++ STL std::transform slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Embarrassingly Parallel: Graph Representation x0 x1 x2 x3 x4 x5 x6 x7 x8 f0 f1 f2 f3 f4 f5 f6 f7 f8 y0 y1 y2 y3 y4 y5 y6 y7 y8 Trivial? Often: no. slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Embarrassingly Parallel: ExamplesSurprisingly useful: • Element-wise linear algebra: Addition, scalar multiplication (not inner product) • Image Processing: Shift, rotate, clip, scale, . . . • Monte Carlo simulation • (Brute-force) Optimization • Random Number Generation • Encryption, Compression (after blocking) But: Still needs a minimum of • Software compilation coordination. How can that be • make -j8 achieved? slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Mother-Child ParallelismMother-Child parallelism: Send initial data Children Mother 0 1 2 3 4 Collect results(formerly called “Master-Slave”) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Embarrassingly Parallel: Issues • Process Creation: Dynamic/Static? • MPI 2 supports dynamic process creation • Job Assignment (‘Scheduling’): Dynamic/Static? • Operations/data light- or heavy-weight? • Variable-size data? • Load Balancing: • Here: easy slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Partition yi = fi (xi−1, xi , xi+1)where i ∈ {1, . . . , N}.Includes straightforward generalizations to dependencies on a larger(but not O(P)-sized!) set of neighbor inputs. slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Partition: Graphx0 x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Partition: Examples• Time-marching (in particular: PDE solvers) • (Including finite differences ) HW3!) →• Iterative Methods • Solve Ax = b (Jacobi, . . . ) • Optimization (all P on single problem) • Eigenvalue solvers• Cellular Automata (Game of Life :-) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Partition: Issues • Only useful when the computation is mainly local • Responsibility for updating one datum rests with one processor • Synchronization, Deadlock, Livelock, . . . • Performance Impact • Granularity • Load Balancing: Thorny issue • → next lecture • Regularity of the Partition? slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Pipelined Computation y = fN (· · · f2(f1(x)) · · · ) = (fN ◦ · · · ◦ f1)(x)where N is fixed. slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Pipelined Computation: Graph f1 f1 f2 f3 f4 f6x y Processor Assignment? slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Pipelined Computation: Examples • Image processing • Any multi-stage algorithm • Pre/post-processing or I/O • Out-of-Core algorithmsSpecific simple examples: • Sorting (insertion sort) • Triangular linear system solve (‘backsubstitution’) • Key: Pass on values as soon as they’re available(will see more efficient algorithms forboth later) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Pipelined Computation: Issues • Non-optimal while pipeline fills or empties • Often communication-inefficient • for large data • Needs some attention to synchronization, deadlock avoidance • Can accommodate some asynchrony But don’t want: • Pile-up • Starvation slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Reductiony = f (· · · f (f (x1, x2), x3), . . . , xN ) where N is the input size. Also known as. . . • Lisp/Python function reduce (Scheme: fold) • C++ STL std::accumulate slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Reduction: Graphx1 x2 x3 x4 x5 x6 y Painful! Not parallelizable. slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Approach to Reduction Can we do better? “Tree” very imbalanced. What property of f would allow ‘rebalancing’ ? ) ? , y f (f (x, y ), z) = f (x, f (y , z)) ( x Looks less improbable if we letf x ◦ y = f (x, y ): x ◦ (y ◦ z)) = (x ◦ y ) ◦ z Has a very familiar name: Associativity slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Reduction: A Better Graph x0 x1 x2 x3 x4 x5 x6 x7 yProcessor allocation? slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Mapping Reduction to the GPU • Obvious: Want to use tree-based approach. Solution: Kernel Decomposition • Problem: Two scales, Work group and Grid • Need to occupy both to make good use of the machine. Avoid global sync by decomposing computation • In particular, need synchronization after each tree stage. • Solution: multiple kernel invocations into Use a two-scale algorithm. 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14 25 25 25 25 25 25 25 25 Level 8 bloc 3 1 7 0 4 1 6 3 4 7 5 9 Level 11 25 14 1 blocIn particular: Use multiple grid invocations to achieve In the case of reductions, code for all levelsHarris With material by M. is theinter-workgroup synchronization. same (Nvidia Corp.) Recursive kernel invocation slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Interleaved Addressing Parallel Reduction: Interleaved Addressing Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Step 1 Thread Stride 1 IDs 0 2 4 6 8 10 12 14 Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Step 2 Thread Stride 2 IDs 0 4 8 12 Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Step 3 Thread Stride 4 IDs 0 8 Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Step 4 Thread 0 Stride 8 IDs Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2Issue: Slow modulo, Divergence 8 With material by M. Harris (Nvidia Corp.) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Sequential Addressing Parallel Reduction: Sequential Addressing Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Step 1 Thread Stride 8 IDs 0 1 2 3 4 5 6 7 Values 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 Step 2 Thread Stride 4 IDs 0 1 2 3 Values 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Step 3 Thread Stride 2 IDs 0 1 Values 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Step 4 Thread IDs 0 Stride 1 Values 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Better! But Sequential addressing is conflict free still not “efficient”. 14Only half of all work items after first round,then a quarter, . . . With material by M. Harris (Nvidia Corp.) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Reduction: Examples• Sum, Inner Product, Norm • Occurs in iterative methods• Minimum, Maximum• Data Analysis • Evaluation of Monte Carlo Simulations• List Concatenation, Set Union• Matrix-Vector product (but. . . ) slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Reduction: Issues • When adding: floating point cancellation? • Serial order goes faster: can use registers for intermediate results • Requires availability of neutral element • GPU-Reduce: Optimization sensitive to data type slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Map-Reducey = f (· · · f (f (g (x1), g (x2)), g (x3)), . . . , g (xN )) where N is the input size. • Lisp naming, again • Mild generalization of reduction slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Map-Reduce: Graph x0 x1 x2 x3 x4 x5 x6 x7g g g g g g g g y slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • MapReduce: DiscussionMapReduce ≥ map + reduce: • Used by Google (and many others) for large-scale data processing • Map generates (key, value) pairs • Reduce operates only on pairs with identical keys • Remaining output sorted by key • Represent all data as character strings • User must convert to/from internal repr. • Messy implementation • Parallelization, fault tolerance, monitoring, data management, load balance, re-run “stragglers”, data locality • Works for Internet-size data • Simple to use even for inexperienced users slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • MapReduce: Examples• String search• (e.g. URL) Hit count from Log• Reverse web-link graph • desired: (target URL, sources)• Sort• Indexing • desired: (word, document IDs)• Machine Learning, Clustering, . . . slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan y1 = x1 y2 = f (y1, x2) .=. . . yN = f (yN−1, xN )where N is the input size. • Also called “prefix sum”. • Or cumulative sum (‘cumsum’) by Matlab/NumPy. slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Graph x0 x1 x2 x3 x4 x5 y1 y2Id y3 Id Id y4 Id Id y5 Id y0 y1 y2 y3 y4 y5 slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Graph This can’t possibly be parallelized. x0 x1Or can it? x3 x2 x4 x5 Again: Need assumptions on f . y1Associativity, commutativity. y2Id y3 Id Id y4 Id Id y5 Id y0 y1 y2 y3 y4 y5 slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Implementation Work-efficient? slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Implementation IITwo sweeps: Upward, downward,both tree-shapeOn upward sweep: • Get values L and R from left and right child • Save L in local variable Mine • Compute Tmp = L + R and pass to parentOn downward sweep: • Get value Tmp from parent • Send Tmp to left child • Sent Tmp+Mine to right child slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Implementation IITwo sweeps: Upward, downward,both tree-shapeOn upward sweep: • Get values L and R from left and right child • Save L in local variable Mine • Compute Tmp = L + R and pass to parentOn downward sweep: • Get value Tmp from parent • Send Tmp to left child Work-efficient? • Sent Tmp+Mine to right child Span rel. to first attempt? slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Examples• Anything with a loop-carried dependence• One row of Gauss-Seidel• One row of triangular solve• Segment numbering if boundaries are known• Low-level building block for many higher-level algorithms algorithms• FIR/IIR Filtering• G.E. Blelloch: Prefix Sums and their Applications slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Scan: Issues • Subtlety: Inclusive/Exclusive Scan • Pattern sometimes hard to recognize • But shows up surprisingly often • Need to prove associativity/commutativity • Useful in Implementation: algorithm cascading • Do sequential scan on parts, then parallelize at coarser granularities slide from Berger & Klöckner (NYU 2010) Embarrassing Partition Pipelines Reduction Scan
  • Divide and Conquer x0 x1 x2 x3 x 4 x5 x6 x7 x 0 x1 x2 x3 x4 x5 x6 x7yi = fi (x1, . . . , xN ) x0 x1 x 2 x3 x4 x5 x6 x 7for i ∈ {1, dots, M}. x0 x1 x2 x3 x4 x5 x6 x7Main purpose: A way of y0 y1 y2 y3 y4 y5 y6 y7partitioning up fullydependent tasks. u0 u1 u2 u3 u4 u5 u6 u7 v 0 v1 v 2 v3 v4 v 5 v6 v7 w0 w1 w2 w 3 w 4 w 5 w 6 w7 Processor allocation? slide from Berger & Klöckner (NYU 2010) D&C General
  • Divide and Conquer: Examples • GEMM, TRMM, TRSM, GETRF (LU) • FFT • Sorting: Bucket sort, Merge sort • N-Body problems (Barnes-Hut, FMM) • Adaptive IntegrationMore fun with work and span:D&C analysis lecture slide from Berger & Klöckner (NYU 2010) D&C General
  • Divide and Conquer: Issues • “No idea how to parallelize that” • → Try D&C • Non-optimal during partition, merge • But: Does not matter if deep levels do heavy enough processing • Subtle to map to fixed-width machines (e.g. GPUs) • Varying data size along tree • Bookkeeping nontrivial for non-2n sizes • Side benefit: D&C is generally cache-friendly slide from Berger & Klöckner (NYU 2010) D&C General
  • CO ME