Massively Parallel Computing                        CS 264 / CSCI E-292Lecture #3: GPU Programming with CUDA | February 8t...
Administrivia•   New here? Welcome!•   HW0: Forum, RSS, Survey•   Lecture 1 & 2 slides posted•   Project teams allowed (up...
During this course,                          r CS264                adapted fowe’ll try to          “                     ...
Todayyey!!
Objectives• Get your started with GPU Programming• Introduce CUDA• “20,000 foot view”• Get used to the jargon...• ...with ...
Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/C...
Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/C...
Revie wThinking Parallel      (last week)
Getting your feet wet• Common scenario: “I want to make the  algorithm X run faster, help me!”• Q: How do you approach the...
How?
How?• Option 1: wait• Option 2: gcc -O3 -msse4.2• Option 3: xlc -O5• Option 4: use parallel libraries (e.g. (cu)blas)• Opt...
What else ?
How about analysis ?
Getting your feet wet           Algorithm X v1.0 Profiling Analysis on Input 10x10x10            100                       ...
Getting your feet wet           Algorithm X v1.0 Profiling Analysis on Input 10x10x10            100                       ...
You need to...• ... understand the problem (duh!)• ... study the current (sequential?) solutions and  their constraints• ....
Some PerspectiveThe “problem tree” for scientific problem solving  9 Some Perspective                               Technic...
Computational Thinking• translate/formulate domain problems into  computational models that can be solved  efficiently by a...
Getting ready...                 Programming ModelsArchitecture      Algorithms                     Languages             ...
You can do it!• thinking parallel is not as hard as you may think• many techniques have been thoroughly explained...• ... ...
Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/C...
Why GPUs?
ti vat i on                                     Mo!   7F"/.;$"#.2./1#2%/C"&.O#./0.2"2$;    12+2E-I1,,6.%C,""<"&88"+&!   P1...
vatio n?M ot i
Motivation                                     ti vat i on                                                   Mo           ...
ti vat i on                                 Mo!   T+$F"&$F+2":0"#$123-*Q;$.3"$$I1#"+;    O+;$9":0"#$$.F+<"$I1#"+;/+28U!   ...
Task vs Data Parallelism       CPUs vs GPUs
Task parallelism• Distribute the tasks across processors based on  dependency• Coarse-grain parallelism     Task 1        ...
Data parallelism• Run a single kernel over many elements –Each element is independently updated –Same operation is applied...
Task vs. Data parallelism• Task parallel  – Independent processes with little communication  – Easy to use     • “Free” on...
CPU vs. GPU• CPU  –   Really fast caches (great for data reuse)  –   Fine branching granularity  –   Lots of different pro...
GPUs ?!   6401-@&)*(&+,3AB0-3-407:&C,(,DDD&    C(*8D+4/!   E*(&3(,-4043*(4&@@0.,3@&3*&?">&3A,-&)D*F&    .*-3(*D&,-@&@,3,&....
From CPUs to GPUs  (how did we end up there?)
Intro PyOpenCL           What and Why? OpenCL“CPU-style” Cores     CPU-“style” cores                              Fetch/  ...
Intro PyOpenCL           What and Why? OpenCLSlimming down      Slimming down                             Fetch/          ...
Intro PyOpenCL       What and Why? OpenCLMore Space: Double the Numberparallel)   Two cores (two fragments in of Cores    ...
Intro PyOpenCL        What and Why? OpenCLFouragain  . . . cores                  (four fragments in parallel)            ...
Intro PyOpenCL       What and Why? OpenCLxteen cores  . . . and again                  (sixteen fragments in parallel)    ...
Intro PyOpenCL       What and Why? OpenCLxteen cores  . . . and again                  (sixteen fragments in parallel)    ...
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL Saving Yet More Space               Fetch/        ...
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL Saving Yet More Space               Fetch/        ...
ecall: simple processing coredd ALUs                        Intro PyOpenCL       What and Why? OpenCL Saving Yet More Spac...
dd ALUs                        Intro PyOpenCL       What and Why? OpenCL Saving Yet More Space               Fetch/       ...
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL  Gratuitous Amounts of Paral...
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL  Gratuitous Amounts of Paral...
Intro PyOpenCL      What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . ....
Intro PyOpenCL      What and Why? OpenCL  Remaining Problem: Slow Memory                                  Fetch/          ...
Intro PyOpenCL      What and Why? OpenCL  Remaining Problem: Slow Memory                              Fetch/              ...
Hiding Memory Latency Hiding shader stalls Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24  ...
Hiding Memory Latency Hiding shader stalls Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24  ...
Hiding Memory Latency Hiding shader stalls Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24  ...
Hiding Memory Latency Hiding shader stalls Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24  ...
Intro PyOpenCL      What and Why? OpenCLGPU Architecture Summary Core Ideas:   1   Many slimmed down cores       → lots of...
Is it free?!   GA,3&,(&3A&.*-4H2-.4I!   $(*1(,+&+243&8&+*(&C(@0.3,8D/    ! 6,3,&,..44&.*A(-.5    ! $(*1(,+&)D*F           ...
Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/C...
CUDA Overview
*,.;<+/$%=*=*8   GPGPU... >?9$ !"!"# @ 6,2A%6)+%=*8%16.%(+1+,0<B45,4.C+% 2./4561(%;D%20C61(%4,.;<+/%0C%(,04)2C    E5,1%F06...
!   !"#$)0,I=%$"E+.K."-:"H.#"F&#?.$"#$%&"!   0&"1$"-6BLM*:*F!   FA1B$,="&K,&I#,I=%$1$.,+,+$?">8E!   7="#.K.#1$.,+K,&)    !...
CUDA Advantages over Legacy GPGPU         Random access to memory                   Thread can access any memory location ...
CUDA Parallel Paradigm         Scale to 100s of cores, 1000s of parallel threads                      Transparently with o...
C with CUDA Extensions: C with a few keywords           !"#$%&()*+&,-#./#01%02%3."1%2%3."1%4(2%3."1%4*5           6       ...
Compiling C with CUDA Applications     !!!                                          C CUDA                 Rest of C " #$%...
Compiling CUDA Code               C/C++ CUDA               Application                  NVCC             CPU Code         ...
CUDA Software Development CUDA Optimized Libraries:         Integrated CPU + GPU   math.h, FFT, BLAS, …                C S...
CUDA Development Tools: cuda-gdbCUDA-gdb         Integrated into gdb         Supports CUDA C         Seamless CPU+GPU deve...
Parallel Source                                 Debugging                                CUDA-gdb in                      ...
Parallel Source                              Debugging                             CUDA-gdb in                            ...
CUDA Development Tools: cuda-memcheckCUDA-MemCheck         Coming with CUDA 3.0 Release         Track out of bounds and mi...
Parallel Source                               Memory                               Checker                              CU...
CUDA Development Tools: (Visual) ProfilerCUDA Visual Profiler
Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/C...
Programming Model
GPU ArchitectureCUDA Programming Model
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model       Fetch/       Decode                ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                                          ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                                          ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                          Axis 0          ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

2,051

Published on

http://cs264.org

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,051
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
254
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

  1. 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #3: GPU Programming with CUDA | February 8th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  2. 2. Administrivia• New here? Welcome!• HW0: Forum, RSS, Survey• Lecture 1 & 2 slides posted• Project teams allowed (up to 2 students) • innocentive-like / challenge-driven ?• HW1: out tonight/tomorrow, due Fri 2/18/11• New guest lecturers! • Wen-mei Hwu (UIUC/NCSA), Cyrus Omar (CMU), Cliff Wooley (NVIDIA), Richard Lethin (Reservoir Labs), James Malcom (Accelereyes), David Cox (Harvard)
  3. 3. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  4. 4. Todayyey!!
  5. 5. Objectives• Get your started with GPU Programming• Introduce CUDA• “20,000 foot view”• Get used to the jargon...• ...with just enough details• Point to relevant external resources
  6. 6. Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
  7. 7. Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
  8. 8. Revie wThinking Parallel (last week)
  9. 9. Getting your feet wet• Common scenario: “I want to make the algorithm X run faster, help me!”• Q: How do you approach the problem?
  10. 10. How?
  11. 11. How?• Option 1: wait• Option 2: gcc -O3 -msse4.2• Option 3: xlc -O5• Option 4: use parallel libraries (e.g. (cu)blas)• Option 5: hand-optimize everything!• Option 6: wait more
  12. 12. What else ?
  13. 13. How about analysis ?
  14. 14. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() Q: What is the maximum speed up ?
  15. 15. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() A: 2X ! :-(
  16. 16. You need to...• ... understand the problem (duh!)• ... study the current (sequential?) solutions and their constraints• ... know the input domain• ... profile accordingly• ... “refactor” based on new constraints (hw/sw)
  17. 17. Some PerspectiveThe “problem tree” for scientific problem solving 9 Some Perspective Technical Problem to be Analyzed Consultation with experts Scientific Model "A" Model "B" Theoretical analysis Discretization "A" Discretization "B" Experiments Iterative equation solver Direct elimination equation solver Parallel implementation Sequential implementation Figure 11: There“problem tree” for to try to achieve the same goal. are many The are many options scientific problem solving. There options to try to achieve the same goal. from Scott et al. “Scientific Parallel Computing” (2005)
  18. 18. Computational Thinking• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources• requires a deep understanding of their relationships adapted from Hwu & Kirk (PASI 2011)
  19. 19. Getting ready... Programming ModelsArchitecture Algorithms Languages Patterns il ers C omp Parallel Thinking Parallel Computing APPLICATIONS adapted from Scott et al. “Scientific Parallel Computing” (2005)
  20. 20. You can do it!• thinking parallel is not as hard as you may think• many techniques have been thoroughly explained...• ... and are now “accessible” to non-experts !
  21. 21. Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
  22. 22. Why GPUs?
  23. 23. ti vat i on Mo! 7F"/.;$"#.2./1#2%/C"&.O#./0.2"2$; 12+2E-I1,,6.%C,""<"&88"+&! P1;$.&1#+,,8! -*Q;3"$O+;$"& " P+&6I+&"&"+#F123O&"R%"2#8,1/1$+$1.2;! S.I! -*Q;3"$I16"& GPUs slide by Matthew Bolitho
  24. 24. vatio n?M ot i
  25. 25. Motivation ti vat i on Mo GPU Fact: nobody cares about theoretical peak Challenge: harness GPU power for real application performanceGFLOPS $"# #<=4>&+234&?@&6.A !"# !"#$#%&()*%&+,-.- CPU 0&12345 /0-&12345 ,-/&89*:;) 67.&89*:;)
  26. 26. ti vat i on Mo! T+$F"&$F+2":0"#$123-*Q;$.3"$$I1#"+; O+;$9":0"#$$.F+<"$I1#"+;/+28U! *+&+,,",0&.#";;123O.&$F"/+;;";! Q2O.&$%2+$",8)*+&+,,",0&.3&+//1231;F+&6V " D,3.&1$F/;+26B+$+?$&%#$%&";/%;$C" O%26+/"2$+,,8&"6";132"6 slide by Matthew Bolitho
  27. 27. Task vs Data Parallelism CPUs vs GPUs
  28. 28. Task parallelism• Distribute the tasks across processors based on dependency• Coarse-grain parallelism Task 1 Task 2 Time Task 3 P1 Task 1 Task 2 Task 3 Task 4 P2 Task 4 Task 5 Task 6 Task 5 Task 6 P3 Task 7 Task 8 Task 9 Task 7 Task 9 Task 8 Task assignment across 3 processors Task dependency graph 30
  29. 29. Data parallelism• Run a single kernel over many elements –Each element is independently updated –Same operation is applied on each element• Fine-grain parallelism –Many lightweight threads, easy to switch context –Maps well to ALU heavy architecture : GPU Data ……. Kernel P1 P2 P3 P4 P5 ……. Pn 31
  30. 30. Task vs. Data parallelism• Task parallel – Independent processes with little communication – Easy to use • “Free” on modern operating systems with SMP• Data parallel – Lots of data on which the same computation is being executed – No dependencies between data elements in each step in the computation – Can saturate many ALUs – But often requires redesign of traditional algorithms 4 slide by Mike Houston
  31. 31. CPU vs. GPU• CPU – Really fast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads Computing? GPU – High performance on a single thread of execution• GPU • Design target for CPUs: – Lotsof math units • Make control away from fast • Take a single thread very – Fastaccess to onboard memory programmer • GPU Computing takes a – Run a program on different fragment/vertex each approach: – High throughput on •parallel tasks Throughput matters— single threads do not • Give explicit control to programmer• CPUs are great for task parallelism• GPUs are great for data parallelism slide by Mike Houston 5
  32. 32. GPUs ?! 6401-@&)*(&+,3AB0-3-407:&C,(,DDD& C(*8D+4/! E*(&3(,-4043*(4&@@0.,3@&3*&?">&3A,-&)D*F& .*-3(*D&,-@&@,3,&.,.A slide by Matthew Bolitho
  33. 33. From CPUs to GPUs (how did we end up there?)
  34. 34. Intro PyOpenCL What and Why? OpenCL“CPU-style” Cores CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13 Credit: Kayvon Fatahalian (Stanford)
  35. 35. Intro PyOpenCL What and Why? OpenCLSlimming down Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  36. 36. Intro PyOpenCL What and Why? OpenCLMore Space: Double the Numberparallel) Two cores (two fragments in of Cores fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode !"#$$%&()*"+,-. !"#$$%&()*"+,-. ALU ALU &*/01.+23.453.623.&2. &*/01.+23.453.623.&2. /%1..+73.423.892:2;. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. (Execute) (Execute) /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A23.+23.+7. Execution Execution /%1..A<3.+<3.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /%1..A=3.+=3.+7. Context Context /A4..A73.1><?2@. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  37. 37. Intro PyOpenCL What and Why? OpenCLFouragain . . . cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context ContextGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  38. 38. Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  39. 39. Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU → 16 independent instruction streams ALU ALU ALU Reality: instruction streams not actually 16 cores = 16very different/independent simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  40. 40. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU (Execute) Execution Context Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  41. 41. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU Idea #2 (Execute) Amortize cost/complexity of managing an instruction stream Execution across many ALUs Context → SIMD Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  42. 42. ecall: simple processing coredd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 ALU managing an instruction Idea #2 (Execute) ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream Execution across many ALUs Ctx Ctx Ctx Context Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  43. 43. dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction Idea #2 ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  44. 44. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  45. 45. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  46. 46. Intro PyOpenCL What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction Idea #3 out-of-order execution Even more parallelism So what now? + Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  47. 47. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. Ctx Ctx Ctx Ctx We’ve removedCtx Ctx Ctx Ctx caches Shared Ctx Data branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ So what now? + 33 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  48. 48. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. 1 2 We’ve removed caches 3 4 branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ now? So what + 34 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  49. 49. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 34Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  50. 50. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 35Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  51. 51. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 36Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  52. 52. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 37Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  53. 53. Intro PyOpenCL What and Why? OpenCLGPU Architecture Summary Core Ideas: 1 Many slimmed down cores → lots of parallelism 2 More ALUs, Fewer Control Units 3 Avoid memory stalls by interleaving execution of SIMD groups (“warps”) Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  54. 54. Is it free?! GA,3&,(&3A&.*-4H2-.4I! $(*1(,+&+243&8&+*(&C(@0.3,8D/ ! 6,3,&,..44&.*A(-.5 ! $(*1(,+&)D*F slide by Matthew Bolitho
  55. 55. Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
  56. 56. CUDA Overview
  57. 57. *,.;<+/$%=*=*8 GPGPU... >?9$ !"!"# @ 6,2A%6)+%=*8%16.%(+1+,0<B45,4.C+% 2./4561(%;D%20C61(%4,.;<+/%0C%(,04)2C E5,1%F060%16.%/0(+C%GH6+I65,+%/04CJK E5,1%0<(.,6)/C%16.%/0(+%CD16)+CC%GH,+1F+,1(%40CC+CJK *,./C1(%,+C5<6CL%;56$ E.5()%<+0,11(%25,M+L%40,625<0,<D%-.,%1.1B(,04)2C%+I4+,6C *.6+160<<D%)()%.M+,)+0F%.-%(,04)2C%:*N &()<D%2.1C6,01+F%/+/.,D%<0D.56%O%022+CC%/.F+< P++F%-.,%/01D%40CC+C%F,M+C%54%;01F7F6)%2.1C5/46.1
  58. 58. ! !"#$)0,I=%$"E+.K."-:"H.#"F&#?.$"#$%&"! 0&"1$"-6BLM*:*F! FA1B$,="&K,&I#,I=%$1$.,+,+$?">8E! 7="#.K.#1$.,+K,&) ! F#,I=%$"&1&#?.$"#$%&" ! F31+N%1N" ! F+1==3.#1$.,+.+$"&K1#"OF8*P slide by Matthew Bolitho
  59. 59. CUDA Advantages over Legacy GPGPU Random access to memory Thread can access any memory location Unlimited access to memory Thread can read/write as many locations as needed User-managed cache (per block) Threads can cooperatively load data into SMEM Any thread can then access any SMEM location Low learning curve Just a few extensions to C No knowledge of graphics is required No graphics API overhead© NVIDIA Corporation 2006 9
  60. 60. CUDA Parallel Paradigm Scale to 100s of cores, 1000s of parallel threads Transparently with one source and same binary Let programmers focus on parallel algorithms Not mechanics of a parallel programming language Enable CPU+GPU Co-Processing CPU & GPU are separate devices with separate memoriesNVIDIA Confidential
  61. 61. C with CUDA Extensions: C with a few keywords !"#$%&()*+&,-#./#01%02%3."1%2%3."1%4(2%3."1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%4(<#=%;%*<#=9 > Standard C Code ??%@0!"A,%&,-#. BCDEF%A,-0,. &()*+&,-#./02%GH82%(2%*59 ++I."J.++%!"#$%&()*+)-..,./#01%02%3."1%2%3."1%4(2%3."1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,$@$(H(9 #3 /# : 05%%*<#=%7%4(<#=%;%*<#=9 Parallel C Code > ??%@0!"A,%)-..,. BCDEF%A,-0,. O#1N%GPQ%1N-,$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &()*+)-..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59NVIDIA Confidential
  62. 62. Compiling C with CUDA Applications !!! C CUDA Rest of C " #$%&$()*+,-./0(%$/1%/(!!!23 Key Kernels Application !!! " NVCC #$%&45678,4*+%591-9$5(!!!23 (Open64) CPU Compiler -$+ 1%/(%:;<% = /<>>%2 8?%@:5A6?%@>8?%@< Modify into " Parallel CUDA object CPU object #$%&B5%/123 CUDA code files files -9$5(6< Linker 45678,4*+%591!!2< !!! " CPU-GPU ExecutableNVIDIA Confidential
  63. 63. Compiling CUDA Code C/C++ CUDA Application NVCC CPU Code PTX Code Virtual PTX to Target Physical Compiler G80 … GPU Target code © 2008 NVIDIA Corporation.
  64. 64. CUDA Software Development CUDA Optimized Libraries: Integrated CPU + GPU math.h, FFT, BLAS, … C Source Code NVIDIA C Compiler NVIDIA Assembly CPU Host Code for Computing (PTX) CUDA Standard C Compiler Profiler Driver GPU CPU
  65. 65. CUDA Development Tools: cuda-gdbCUDA-gdb Integrated into gdb Supports CUDA C Seamless CPU+GPU development experience Enabled on all CUDA supported 32/64bit Linux distros Set breakpoint and single step any source line Access and print all CUDA memory allocs, local, global, constant and shared vars.© NVIDIA Corporation 2009
  66. 66. Parallel Source Debugging CUDA-gdb in emacs CUDA-GDB in emacs© NVIDIA Corporation 2009
  67. 67. Parallel Source Debugging CUDA-gdb in DDD© NVIDIA Corporation 2009
  68. 68. CUDA Development Tools: cuda-memcheckCUDA-MemCheck Coming with CUDA 3.0 Release Track out of bounds and misaligned accesses Supports CUDA C Integrated into the CUDA-GDB debugger Available as standalone tool on all OS platforms.© NVIDIA Corporation 2009
  69. 69. Parallel Source Memory Checker CUDA- MemCheck© NVIDIA Corporation 2009
  70. 70. CUDA Development Tools: (Visual) ProfilerCUDA Visual Profiler
  71. 71. Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
  72. 72. Programming Model
  73. 73. GPU ArchitectureCUDA Programming Model
  74. 74. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx (“Registers”) Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 16 kiB Ctx 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) Shared 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  75. 75. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Program as if there were Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  76. 76. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Consider: Which there were do automatically? Program as if is easy to Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) or Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per Sequential program → parallel hardware? core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  77. 77. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  78. 78. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) (Work) Item 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation or “Thread” Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  79. 79. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  80. 80. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  81. 81. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 ? Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  82. 82. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  83. 83. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  84. 84. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  85. 85. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  86. 86. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  87. 87. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  88. 88. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  89. 89. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×