[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
Upcoming SlideShare
Loading in...5
×
 

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

on

  • 2,056 views

http://cs264.org

http://cs264.org

Statistics

Views

Total Views
2,056
Slideshare-icon Views on SlideShare
2,053
Embed Views
3

Actions

Likes
1
Downloads
218
Comments
0

1 Embed 3

http://www.docshut.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    [Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics [Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics Presentation Transcript

    • Massively Parallel Computing CS 264 / CSCI E-292Lecture #3: GPU Programming with CUDA | February 8th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
    • Administrivia• New here? Welcome!• HW0: Forum, RSS, Survey• Lecture 1 & 2 slides posted• Project teams allowed (up to 2 students) • innocentive-like / challenge-driven ?• HW1: out tonight/tomorrow, due Fri 2/18/11• New guest lecturers! • Wen-mei Hwu (UIUC/NCSA), Cyrus Omar (CMU), Cliff Wooley (NVIDIA), Richard Lethin (Reservoir Labs), James Malcom (Accelereyes), David Cox (Harvard)
    • During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
    • Todayyey!!
    • Objectives• Get your started with GPU Programming• Introduce CUDA• “20,000 foot view”• Get used to the jargon...• ...with just enough details• Point to relevant external resources
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • Revie wThinking Parallel (last week)
    • Getting your feet wet• Common scenario: “I want to make the algorithm X run faster, help me!”• Q: How do you approach the problem?
    • How?
    • How?• Option 1: wait• Option 2: gcc -O3 -msse4.2• Option 3: xlc -O5• Option 4: use parallel libraries (e.g. (cu)blas)• Option 5: hand-optimize everything!• Option 6: wait more
    • What else ?
    • How about analysis ?
    • Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() Q: What is the maximum speed up ?
    • Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() A: 2X ! :-(
    • You need to...• ... understand the problem (duh!)• ... study the current (sequential?) solutions and their constraints• ... know the input domain• ... profile accordingly• ... “refactor” based on new constraints (hw/sw)
    • Some PerspectiveThe “problem tree” for scientific problem solving 9 Some Perspective Technical Problem to be Analyzed Consultation with experts Scientific Model "A" Model "B" Theoretical analysis Discretization "A" Discretization "B" Experiments Iterative equation solver Direct elimination equation solver Parallel implementation Sequential implementation Figure 11: There“problem tree” for to try to achieve the same goal. are many The are many options scientific problem solving. There options to try to achieve the same goal. from Scott et al. “Scientific Parallel Computing” (2005)
    • Computational Thinking• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources• requires a deep understanding of their relationships adapted from Hwu & Kirk (PASI 2011)
    • Getting ready... Programming ModelsArchitecture Algorithms Languages Patterns il ers C omp Parallel Thinking Parallel Computing APPLICATIONS adapted from Scott et al. “Scientific Parallel Computing” (2005)
    • You can do it!• thinking parallel is not as hard as you may think• many techniques have been thoroughly explained...• ... and are now “accessible” to non-experts !
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • Why GPUs?
    • ti vat i on Mo! 7F"/.;$"#.2./1#2%/C"&.O#./0.2"2$; 12+2E-I1,,6.%C,""<"&88"+&! P1;$.&1#+,,8! -*Q;3"$O+;$"& " P+&6I+&"&"+#F123O&"R%"2#8,1/1$+$1.2;! S.I! -*Q;3"$I16"& GPUs slide by Matthew Bolitho
    • vatio n?M ot i
    • Motivation ti vat i on Mo GPU Fact: nobody cares about theoretical peak Challenge: harness GPU power for real application performanceGFLOPS $"# #<=4>&+234&?@&6.A !"# !"#$#%&()*%&+,-.- CPU 0&12345 /0-&12345 ,-/&89*:;) 67.&89*:;)
    • ti vat i on Mo! T+$F"&$F+2":0"#$123-*Q;$.3"$$I1#"+; O+;$9":0"#$$.F+<"$I1#"+;/+28U! *+&+,,",0&.#";;123O.&$F"/+;;";! Q2O.&$%2+$",8)*+&+,,",0&.3&+//1231;F+&6V " D,3.&1$F/;+26B+$+?$&%#$%&";/%;$C" O%26+/"2$+,,8&"6";132"6 slide by Matthew Bolitho
    • Task vs Data Parallelism CPUs vs GPUs
    • Task parallelism• Distribute the tasks across processors based on dependency• Coarse-grain parallelism Task 1 Task 2 Time Task 3 P1 Task 1 Task 2 Task 3 Task 4 P2 Task 4 Task 5 Task 6 Task 5 Task 6 P3 Task 7 Task 8 Task 9 Task 7 Task 9 Task 8 Task assignment across 3 processors Task dependency graph 30
    • Data parallelism• Run a single kernel over many elements –Each element is independently updated –Same operation is applied on each element• Fine-grain parallelism –Many lightweight threads, easy to switch context –Maps well to ALU heavy architecture : GPU Data ……. Kernel P1 P2 P3 P4 P5 ……. Pn 31
    • Task vs. Data parallelism• Task parallel – Independent processes with little communication – Easy to use • “Free” on modern operating systems with SMP• Data parallel – Lots of data on which the same computation is being executed – No dependencies between data elements in each step in the computation – Can saturate many ALUs – But often requires redesign of traditional algorithms 4 slide by Mike Houston
    • CPU vs. GPU• CPU – Really fast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads Computing? GPU – High performance on a single thread of execution• GPU • Design target for CPUs: – Lotsof math units • Make control away from fast • Take a single thread very – Fastaccess to onboard memory programmer • GPU Computing takes a – Run a program on different fragment/vertex each approach: – High throughput on •parallel tasks Throughput matters— single threads do not • Give explicit control to programmer• CPUs are great for task parallelism• GPUs are great for data parallelism slide by Mike Houston 5
    • GPUs ?! 6401-@&)*(&+,3AB0-3-407:&C,(,DDD& C(*8D+4/! E*(&3(,-4043*(4&@@0.,3@&3*&?">&3A,-&)D*F& .*-3(*D&,-@&@,3,&.,.A slide by Matthew Bolitho
    • From CPUs to GPUs (how did we end up there?)
    • Intro PyOpenCL What and Why? OpenCL“CPU-style” Cores CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13 Credit: Kayvon Fatahalian (Stanford)
    • Intro PyOpenCL What and Why? OpenCLSlimming down Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLMore Space: Double the Numberparallel) Two cores (two fragments in of Cores fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode !"#$$%&()*"+,-. !"#$$%&()*"+,-. ALU ALU &*/01.+23.453.623.&2. &*/01.+23.453.623.&2. /%1..+73.423.892:2;. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. (Execute) (Execute) /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A23.+23.+7. Execution Execution /%1..A<3.+<3.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /%1..A=3.+=3.+7. Context Context /A4..A73.1><?2@. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLFouragain . . . cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context ContextGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU → 16 independent instruction streams ALU ALU ALU Reality: instruction streams not actually 16 cores = 16very different/independent simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU (Execute) Execution Context Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU Idea #2 (Execute) Amortize cost/complexity of managing an instruction stream Execution across many ALUs Context → SIMD Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • ecall: simple processing coredd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 ALU managing an instruction Idea #2 (Execute) ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream Execution across many ALUs Ctx Ctx Ctx Context Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction Idea #2 ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction Idea #3 out-of-order execution Even more parallelism So what now? + Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. Ctx Ctx Ctx Ctx We’ve removedCtx Ctx Ctx Ctx caches Shared Ctx Data branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ So what now? + 33 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. 1 2 We’ve removed caches 3 4 branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ now? So what + 34 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 34Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
    • Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 35Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
    • Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 36Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
    • Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 37Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
    • Intro PyOpenCL What and Why? OpenCLGPU Architecture Summary Core Ideas: 1 Many slimmed down cores → lots of parallelism 2 More ALUs, Fewer Control Units 3 Avoid memory stalls by interleaving execution of SIMD groups (“warps”) Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Is it free?! GA,3&,(&3A&.*-4H2-.4I! $(*1(,+&+243&8&+*(&C(@0.3,8D/ ! 6,3,&,..44&.*A(-.5 ! $(*1(,+&)D*F slide by Matthew Bolitho
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • CUDA Overview
    • *,.;<+/$%=*=*8 GPGPU... >?9$ !"!"# @ 6,2A%6)+%=*8%16.%(+1+,0<B45,4.C+% 2./4561(%;D%20C61(%4,.;<+/%0C%(,04)2C E5,1%F060%16.%/0(+C%GH6+I65,+%/04CJK E5,1%0<(.,6)/C%16.%/0(+%CD16)+CC%GH,+1F+,1(%40CC+CJK *,./C1(%,+C5<6CL%;56$ E.5()%<+0,11(%25,M+L%40,625<0,<D%-.,%1.1B(,04)2C%+I4+,6C *.6+160<<D%)()%.M+,)+0F%.-%(,04)2C%:*N &()<D%2.1C6,01+F%/+/.,D%<0D.56%O%022+CC%/.F+< P++F%-.,%/01D%40CC+C%F,M+C%54%;01F7F6)%2.1C5/46.1
    • ! !"#$)0,I=%$"E+.K."-:"H.#"F&#?.$"#$%&"! 0&"1$"-6BLM*:*F! FA1B$,="&K,&I#,I=%$1$.,+,+$?">8E! 7="#.K.#1$.,+K,&) ! F#,I=%$"&1&#?.$"#$%&" ! F31+N%1N" ! F+1==3.#1$.,+.+$"&K1#"OF8*P slide by Matthew Bolitho
    • CUDA Advantages over Legacy GPGPU Random access to memory Thread can access any memory location Unlimited access to memory Thread can read/write as many locations as needed User-managed cache (per block) Threads can cooperatively load data into SMEM Any thread can then access any SMEM location Low learning curve Just a few extensions to C No knowledge of graphics is required No graphics API overhead© NVIDIA Corporation 2006 9
    • CUDA Parallel Paradigm Scale to 100s of cores, 1000s of parallel threads Transparently with one source and same binary Let programmers focus on parallel algorithms Not mechanics of a parallel programming language Enable CPU+GPU Co-Processing CPU & GPU are separate devices with separate memoriesNVIDIA Confidential
    • C with CUDA Extensions: C with a few keywords !"#$%&()*+&,-#./#01%02%3."1%2%3."1%4(2%3."1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%4(<#=%;%*<#=9 > Standard C Code ??%@0!"A,%&,-#. BCDEF%A,-0,. &()*+&,-#./02%GH82%(2%*59 ++I."J.++%!"#$%&()*+)-..,./#01%02%3."1%2%3."1%4(2%3."1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,$@$(H(9 #3 /# : 05%%*<#=%7%4(<#=%;%*<#=9 Parallel C Code > ??%@0!"A,%)-..,. BCDEF%A,-0,. O#1N%GPQ%1N-,$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &()*+)-..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59NVIDIA Confidential
    • Compiling C with CUDA Applications !!! C CUDA Rest of C " #$%&$()*+,-./0(%$/1%/(!!!23 Key Kernels Application !!! " NVCC #$%&45678,4*+%591-9$5(!!!23 (Open64) CPU Compiler -$+ 1%/(%:;<% = /<>>%2 8?%@:5A6?%@>8?%@< Modify into " Parallel CUDA object CPU object #$%&B5%/123 CUDA code files files -9$5(6< Linker 45678,4*+%591!!2< !!! " CPU-GPU ExecutableNVIDIA Confidential
    • Compiling CUDA Code C/C++ CUDA Application NVCC CPU Code PTX Code Virtual PTX to Target Physical Compiler G80 … GPU Target code © 2008 NVIDIA Corporation.
    • CUDA Software Development CUDA Optimized Libraries: Integrated CPU + GPU math.h, FFT, BLAS, … C Source Code NVIDIA C Compiler NVIDIA Assembly CPU Host Code for Computing (PTX) CUDA Standard C Compiler Profiler Driver GPU CPU
    • CUDA Development Tools: cuda-gdbCUDA-gdb Integrated into gdb Supports CUDA C Seamless CPU+GPU development experience Enabled on all CUDA supported 32/64bit Linux distros Set breakpoint and single step any source line Access and print all CUDA memory allocs, local, global, constant and shared vars.© NVIDIA Corporation 2009
    • Parallel Source Debugging CUDA-gdb in emacs CUDA-GDB in emacs© NVIDIA Corporation 2009
    • Parallel Source Debugging CUDA-gdb in DDD© NVIDIA Corporation 2009
    • CUDA Development Tools: cuda-memcheckCUDA-MemCheck Coming with CUDA 3.0 Release Track out of bounds and misaligned accesses Supports CUDA C Integrated into the CUDA-GDB debugger Available as standalone tool on all OS platforms.© NVIDIA Corporation 2009
    • Parallel Source Memory Checker CUDA- MemCheck© NVIDIA Corporation 2009
    • CUDA Development Tools: (Visual) ProfilerCUDA Visual Profiler
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • Programming Model
    • GPU ArchitectureCUDA Programming Model
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx (“Registers”) Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 16 kiB Ctx 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) Shared 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Program as if there were Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Consider: Which there were do automatically? Program as if is easy to Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) or Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per Sequential program → parallel hardware? core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) (Work) Item 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation or “Thread” Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 ? Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Really: Block provides Group Fetch/ Fetch/ Fetch/ Decode Decode Decode pool of parallelism to draw from. 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx block Shared Shared Shared X,Y,Z order within group Software representation matters. (Not among Hardware groups, though.) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
    • a pause?Need
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • Threading Hierarchy
    • Some definitions• Kernel – GPU program that runs on a thread grid• Thread hierarchy – Grid : a set of blocks – Block : a set of warps – Warp : a SIMD group of 32 threads – Grid size * block size = total # of threads Grid Kernel Block 1 Block 2 Block n warp warp <diffuseShader>: sample  r0,  v4,  t0,  s0 warp warp warp warp mul    r3,  v0,  cb0[0] madd  r3,  v1,  cb0[1],  r3 madd  r3,  v2,  cb0[2],  r3 clmp  r3,  r3,  l(0.0),  l(1.0) mul    o0,  r0,  r3 ..... mul    o1,  r1,  r3 mul    o2,  r2,  r3 mov    o3,  l(1.0)
    • CUDA Kernels and Threads Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few Definitions Device = GPU Host = CPU Kernel = function that runs on the device © 2008 NVIDIA Corporation.
    • Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … © 2008 NVIDIA Corporation.
    • Thread Batching Kernel launches a grid of thread blocks Threads within a block cooperate via shared memory Threads within a block can synchronize Threads in different blocks cannot cooperate Allows programs to transparently scale to different GPUs Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared Memory Shared Memory Shared Memory © 2008 NVIDIA Corporation.
    • Transparent Scalability Hardware is free to schedule thread blocks on any processor A kernel scales across parallel multiprocessors Kernel gridDevice Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © 2008 NVIDIA Corporation.
    • Transparent Scalability Hardware is free to schedule thread blocks on any processor A kernel scales across parallel multiprocessors elism! f pa rall nt o Kernel grid ou Device Device s am Block 0 Block 1 tuitou Block 2 Block 3Gra Block 0 Block 1 Block 4 Block 6 Block 5 Block 7 Block 0 Block 1 Block 2 Block 3 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © 2008 NVIDIA Corporation.
    • u p ca ll ! Wakehttp://www.youtube.com/watch?v=1yH_j8-VVLohttp://www.youtube.com/watch?v=qRuNxHqwazs
    • Transparent Scalability Hardware is free to schedule thread blocks on any processor A kernel scales across parallel multiprocessors Kernel gridDevice Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © 2008 NVIDIA Corporation.
    • 8-Series Architecture (G80) 128 thread processors execute kernel threads 16 multiprocessors, each contains 8 thread processors Shared memory enables thread cooperation Multiprocessor Shared Shared Shared Shared Shared Shared Shared Shared Thread Memory Memory Memory Memory Memory Memory Memory Memory Processors Shared Memory Shared Shared Shared Shared Shared Shared Shared Shared Memory Memory Memory Memory Memory Memory Memory Memory © 2008 NVIDIA Corporation.
    • 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors, each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Multiprocessor Thread Processors Double Shared Memory © 2008 NVIDIA Corporation.
    • Fermi Architecturee.g. GTX 480:• !"#$%&()*$+,-(..,.$/,0+(*$ 12%,$34$.%()512/$ 506%1+,-(..,.$789.:$,;$<=$-,(.$ ()-&• >+$%,$4?#$@A$,;$@BBCD$BCE9• FGG$9(5,H$80++,%I J3$G)-&($I J=$G)-&($7K4"$LA:Note: GTX 580 has now512 processors!
    • Hardware Multithreading Hardware Multithreading Hardware allocates resources to blocksM blocks need: thread slots, registers, shared memoryT IU blocks don’t run until resources are available Hardware schedules threads threads have their own registers any thread not waiting for something can run context switching is free – every cyclearedmory Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
    • Hardware Multithreading Hardware Multithreading Hardware allocates resources to blocksM blocks need: thread slots, registers, shared memoryT IU blocks don’t run until resources are available Hardware schedules threads threads have their own registers any thread not waiting for something can run context switching is free – every cyclearedmory Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
    • Hardware Multithreading Hardware Multithreading Hardware allocates resources to blocksM blocks need: thread slots, registers, shared memoryT IU blocks don’t run until resources are available Hardware schedules threads threads have their own registers any thread not waiting for something can run context switching is free – every cyclearedmory Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
    • Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32(clocks) 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 37Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
    • SummaryExecution ModelSoftware Hardware Threads are executed by thread Thread processors Processor Thread Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can Thread reside on one multiprocessor - limited Block Multiprocessor by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks ... Only one kernel can execute on a Grid device at one time Device © 2008 NVIDIA Corporation.
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • Memory/Communication Hierarchy
    • Example...
    • The Memory Hierarchy xa m ple E Hierarchy of increasingly bigger, slower memories: faster Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) biggeradapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
    • GPU in PC Architecture
    • PC Architecture8 GB/s >?@ ?>L9G=2%&66"K16 J%+8#"F7(&"K16 H%2$7,6">%("I" A+%#$)%7(B& F+1#$)%7(B& >@C! E&.+%/"K16 ?>L"K16 3+ Gb/s CD!E F!:! G#$&%8&# !160+ GB/s to VRAM 25+ GB/s modified from Matthew Bolitho
    • PCI not-so-Express Bus! ./012 +%"./0$! D&2*,&("!H?! ?5?M"J1**"C12*&="F&%7*M"F/..&#%7,"K16! 53NEKI6")8(O7(#$"78"&,$"(7%&,#7+8! "#$$#%&()#$*+%,-(+%.#($/&.+0&,(1&2%,3( ,+8<7B1%#7+86P""GPBQ ! ?>L9G"4R="S"4R"*8&6 ! 4R"#7.&6"#$&")8(O7(#$"TUHKI6V modified from Matthew Bolitho
    • Back to the GPU...
    • Multiple Memory Scopes Per-thread private memory Thread Each thread has its own Per-thread local memory Local Memory Stacks, other private data Per-thread-block shared Block memory Per-block Small memory close to the Shared processor, low latency Memory Allocated per thread block Main memory Kernel 0 Sequential . Blocks GPU frame buffer . . Per-device Global Kernel 1 Memory Can be accessed by any ... thread in any thread block© NVIDIA 2010 18
    • Thread Cooperation The Missing Piece: threads may need to cooperate Thread cooperation is valuable Share results to avoid redundant computation Share memory accesses Drastic bandwidth reduction Thread cooperation is a powerful feature of CUDA Cooperation between a monolithic array of threads is not scalable Cooperation within smaller batches of threads is scalable © 2008 NVIDIA Corporation.
    • Multiple Memory Scopes Per-thread private memory Thread Each thread has its own Per-thread local memory Local Memory Stacks, other private data Per-thread-block shared Block memory Per-block Small memory close to the Shared processor, low latency Memory Allocated per thread block Main memory Kernel 0 Sequential . Blocks GPU frame buffer . . Per-device Global Kernel 1 Memory Can be accessed by any ... thread in any thread block© NVIDIA 2010 18
    • Multiple Memory Scopes Per-thread private memory Thread Each thread has its own Per-thread local memory Local Memory Stacks, other private data Per-thread-block shared Block memory Per-block Small memory close to the Shared processor, low latency Memory Allocated per thread block Main memory Kernel 0 Sequential . Blocks GPU frame buffer . . Per-device Global Kernel 1 Memory Can be accessed by any ... thread in any thread block© NVIDIA 2010 18
    • Kernel Memory Access Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
    • Global Memory Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
    • Global Memory Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
    • Memory Architecture Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application© NVIDIA Corporation 2009 12
    • Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory: Allocate / free Copy data to and from device Applies to global device memory (DRAM)Host Device GPU DRAM Multiprocessor CPU Local Multiprocessor Memory Multiprocessor DRAM Chipset Global Registers Memory Shared Memory © 2008 NVIDIA Corporation.
    • Caches Configurable L1 cache per SM 16KB L1$ / 48KB Shared Tesla Memory Hiearchy Fermi Memory Hiearchy Memory Thread Thread 48KB L1$ / 16KB Shared Memory Shared Memory Register File Register File Shared 768KB L2 cache L1 Cache / Shared Memory Compute motivation: L2 Cache Caching captures locality, amplifies bandwidth Caching more effective than Shared Memory RAM for DRAM DRAM irregular or unpredictable access Ray tracing, sparse matrix Caching helps latency sensitive cases© NVIDIA 2010 24
    • ... how do I program these &#*@ GPUs ??
    • Outline• Thinking Parallel (review)• Why GPUs ?• CUDA Overview• Programming Model• Threading/Execution Hierarchy• Memory/Communication Hierarchy• CUDA Programming
    • CUDA Programming
    • Managing Memory in CUDA
    • Kernel Memory Access Revie w Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
    • Global Memory Revie w Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
    • Managing Memory Revie w CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory: Allocate / free Copy data to and from device Applies to global device memory (DRAM)Host Device GPU DRAM Multiprocessor CPU Local Multiprocessor Memory Multiprocessor DRAM Chipset Global Registers Memory Shared Memory © 2008 NVIDIA Corporation.
    • CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread__shared__ int shared_var; shared block block__device__ int global_var; global grid application__constant__ int constant_var; constant grid application!   “automatic” scalar variables without qualifier reside in a register !   compiler will spill to thread local memory!   “automatic” array variables without qualifier reside in thread-local memory© 2008 NVIDIA Corporation
    • CUDA Variable Type Performance Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x__shared__ int shared_var; shared 1x__device__ int global_var; global 100x__constant__ int constant_var; constant 1x!   scalar variables reside in fast, on-chip registers!   shared variables reside in fast, on-chip memories!   thread-local arrays & global variables reside in uncached off-chip memory!   constant variables reside in cached off-chip memory© 2008 NVIDIA Corporation
    • CUDA Variable Type Scale Variable declaration Instances Visibility int var; 100,000s 1 int array_var[10]; 100,000s 1__shared__ int shared_var; 100s 100s__device__ int global_var; 1 100,000s__constant__ int constant_var; 1 100,000s!   100Ks per-thread variables, R/W by 1 thread!   100s shared variables, each R/W by 100s of threads!   1 global variable is R/W by 100Ks threads!   1 constant variable is readable by 100Ks threads© 2008 NVIDIA Corporation
    • GPU Memory Allocation / Release cudaMalloc(void ** pointer, size_t nbytes) cudaMemset(void * pointer, int value, size_t count) cudaFree(void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int *a_d = 0; cudaMalloc( (void**)&a_d, nbytes ); cudaMemset( a_d, 0, nbytes); cudaFree(a_d); © 2008 NVIDIA Corporation.
    • Data Copies cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); direction specifies locations (host or device) of src and dst Blocks CPU thread: returns after the copy is complete Doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_d a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h b_d for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_d a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h b_d for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_d a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h b_d for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_d a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h b_d for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_d a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h b_d for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h a_d a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); b_h b_d for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Data Movement Exampleint main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data Host Device int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;} © 2008 NVIDIA Corporation.
    • Execution in CUDA
    • Executing Code on the GPU Kernels are C functions with some restrictions Cannot access host memory Must have void return type No variable number of arguments (“varargs”) Not recursive No static variables Function arguments automatically copied from host to device © 2008 NVIDIA Corporation.
    • Function Qualifiers Kernels designated by function qualifier: __global__ Function called from host and executed on device Must return void Other CUDA function qualifiers __device__ Function called from device and run on device Cannot be called from host code __host__ Function called from host and executed on host (default) __host__ and __device__ qualifiers can be combined to generate both CPU and GPU code © 2008 NVIDIA Corporation.
    • CUDA Built-in Device Variables All __global__ and __device__ functions have access to these automatically defined variables dim3 gridDim; Dimensions of the grid in blocks (at most 2D) dim3 blockDim; Dimensions of the block in threads dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block © 2008 NVIDIA Corporation.
    • Launching Kernels Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>>(…) Execution Configuration (“<<< >>>”) dG - dimension and size of grid in blocks Two-dimensional: x and y Blocks launched in the grid: dG.x * dG.y dB - dimension and size of blocks in threads: Three-dimensional: x, y, and z Threads per block: dB.x * dB.y * dB.z Unspecified dim3 fields initialize to 1 © 2008 NVIDIA Corporation.
    • Execution Configuration Examplesdim3 grid, block;grid.x = 2; grid.y = 4;block.x = 8; block.y = 16;kernel<<<grid, block>>>(...); Equivalent assignment usingdim3 grid(2, 4), block(8,16); constructor functionskernel<<<grid, block>>>(...);kernel<<<32,512>>>(...); © 2008 NVIDIA Corporation.
    • Unique Thread IDs Built-in variables are used to determine unique thread IDs Map from local thread ID (threadIdx) to a global ID which can be used as array indices Grid blockIdx.x 0 1 2blockDim.x = 5 threadIdx.x 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14blockIdx.x*blockDim.x + threadIdx.x © 2008 NVIDIA Corporation.
    • Minimal Kernels Basics__global__ void minimal( int* a_d, int value){ *a_d = value;}__global__ void assign( int* a_d, int value){ int idx = blockDim.x * blockIdx.x + threadIdx.x; a_d[idx] = value;} © 2008 NVIDIA Corporation.
    • Increment Array ExampleCPU program CUDA programvoid inc_cpu(int *a, int N) __global__ void inc_gpu(int *a, int N){ { int idx; int idx = blockIdx.x * blockDim.x + threadIdx.x; for (idx = 0; idx<N; idx++) if (idx < N) a[idx] = a[idx] + 1; a[idx] = a[idx] + 1;} }int main() int main(){ { ... … inc_cpu(a, N); dim3 dimBlock (blocksize);} dim3 dimGrid( ceil( N / (float)blocksize) ); inc_gpu<<<dimGrid, dimBlock>>>(a, N); } © 2008 NVIDIA Corporation.
    • Synchronization in CUDA
    • Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel executes after all previous CUDA calls have completed cudaMemcpy() is synchronous control returns to CPU after copy completes copy starts after all previous CUDA calls have completed cudaThreadSynchronize() blocks until all previous CUDA calls complete © 2008 NVIDIA Corporation.
    • Host Synchronization Example// copy data from host to devicecudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);// execute the kernelinc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);// run independent CPU coderun_cpu_stuff();// copy data from device back to hostcudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost); © 2008 NVIDIA Corporation.
    • Thread Synchronization• __syncthreads() • barrier for threads within their block • e.g. to avoid “memory hazard” when accessing shared memory• __threadfence() • interblock synchronization • flushes global memory writes to make them visible to all threads
    • More?• CUDA C Programming Guide • CUDA C Best Practices Guide • CUDA Reference Manual • API Reference, PTX ISA 2.2 • CUDA-GDB User Manual • Visual Profiler Manual  • User Guides: CUBLAS, CUFFT, CUSPARSE, CURANDhttp://developer.nvidia.com/object/gpucomputing.html
    • More?
    • one more thing or two...
    • Life/Code Hacking #1 Getting Things Done
    • : Org anizePh ase 1
    • hase 2: DOP
    • 3: Re vi e wP hase
    • Tools• Notepad + Pen ;-)• Gmail: labels, shorcuts, quick links and advanced search• Lists: e.g. Remember the Milk• Many more: Google “gtd tools”
    • CO ME
    • Back pocket slides slide by David Cox
    • GPUHistory
    • Historynot true!
    • History!""#$%&$() 4:.;/&,$$()&#;+(,.#;<(/;=>9;1.),./$)8 *(++&), !"#$% ! ?./$%.2;&),;@/$+$$A.2 -.(+./0 ! 4/&)2<(/+&$()2 ! !"#$%"&#()*)+,%,*-.,%/01&2./$3&$() &$%#$% 4.56/. ! B9;C+&8.;<(/;,$2"#&0 7/&8+.) 9$2"#&0 slide by Matthew Bolitho
    • History! 1.),./;.(&"#,(.0&F;/.&#$2$%;%(+"6./; 8.)./&.,;2%.).2 ! G&%=;A/&+.;$2;%(+"#.5 ! H..,;IJ;A/&+.2;"./;2.%(),! "#$%&()*)+,,&-,(. " 3&4.,#(&4)5#"46#"& slide by Matthew Bolitho
    • *:O;P;N(2 History !""#$%&$() ! 4(;$+"/(K.;"./A(/+&)%.F;+(K.;2(+.; L(/M;(;,.,$%&.,;=&/,L&/. *(++&), ! N&/,L&/.;%(6#,;"/(%.22;.&%=;K./.5; -.(+./0 &),;.&%=;A/&8+.);$),.".),.)#0; " 7.$528)*#"#22&2-/&"=$%2;N&/,L&/. 1&2./$3&$() 4.56/. 7/&8+.) 9$2"#&0 slide by Matthew Bolitho
    • History! /0)1*23045&#43)-46)(2&7!"#$%&!()*"+(8 " N&/,L&/.;L&2;=&/,L$/.,;(;"./A(/+;=.; ("./&$()2;$);=.;"$".#$).! GK.)6&##0F;"$".#$).;@.%&+.;+(/.; "/(8/&++&@#. slide by Matthew Bolitho
    • *=>:?:@(2 History !""#$%&$() ! 4.56/.:&),:7/&8+.):2&8.2:;.%&+.: +(/.:"/(8/&++&;#.<:%(+;$).,:$)(: *(++&), !"#$%&()*+(,)- -.(+./0 ! =/(8/&++&;#.:C$&:&22.+;#0:#&)86&8. ! D.+(/0:/.&,2:C$&:.56/.:#((E6"2-/&"A$%2:@&/,B&/. 1&2./$3&$() ! !./(0$()-*)1)2#*34452/6 ! F$+$.,:=/(8/&+:2$3. 7/&8+.):>)$ ! G(:/.&#:;/&)%A$)8:HA62:#(("$)8I 9$2"#&0 slide by Matthew Bolitho
    • *=>:?:@(2 History !""#$%&$() ! -.(+./0:2&8.:;.%&+.: /#4%#$&&$738*9$330*!:#)1*+(,)- *(++&), ! =/(8/&++&;#.:C$&:&22.+;#0:#&)86&8. J./.5:>)$ ! G(:+.+(/0:/.&,2K-/&"A$%2:@&/,B&/. 1&2./$3&$() ! F$+$.,:=/(8/&+:2$3. ! G(:/.&#:;/&)%A$)8:HA62:#(("$)8I 7/&8+.):>)$ 9$2"#&0 slide by Matthew Bolitho
    • *=>:?:@(2 History !""#$%&$() ! 4A$)82:$+"/(C.,:(C./:$+.L *(++&), ! J./.5:6)$:%&):,(:+.+(/0:/.&,2 ! D&5$+6+:=/(8/&+:2$3.:$)%/.&2., J./.5:>)$ ! M/&)%A$)8:26""(/ ! @$8A./:#.C.#:#&)86&8.2:H.N8N:@FOF<:*8I-/&"A$%2:@&/,B&/. 1&2./$3&$() ! G.$A./:A.:J./.5:(/:7/&8+.):6)$2: %(6#,:B/$.:(:+.+(/0N::*&):()#0:B/$.: 7/&8+.):>)$ (:P/&+.:;6PP./ ! G(:$).8./:+&A ! G(:;$B$2.:("./&(/2 9$2"#&0 slide by Matthew Bolitho
    • *=>:?:@(2 History !""#$%&$() *(++&), 1&2./$3&$()-/&"A$%2:@&/,B&/. 9$2"#&0 *#+,-"($& !"#$"%&()$ ()$ 4.56/.:D.+(/0 4.56/.:D.+(/0 slide by Matthew Bolitho
    • History! ;(*<==>*?@+A6*79$&*&46)3B*/#4%#$&&$738*! !C23),Q/$66-*$3%4#,)D&6*$334E0*E#,)6*)4* +.+(/0L ! R):"&22:S:B/$.:(:P/&+.;6PP./ ! 1.;$),:A.:P/&+.;6PP./ &2:&:.56/. ! 1.&,:$:$):"&22:T<:.%N! M6:B./.:$).PP$%$.) slide by Matthew Bolitho
    • History! !"#$%&"(%)%&*&%+,#-././01+))2,%&345"6 7././08.","5*(/25$+#"9+)$2&*&%+,+,&:"./0; !"!"#$"%&%()* ! !"#$%&()&*)+%),&-#.% ! /(*1"<*&*%,&"=&25"# ! !5*6*>(*&?2*<7+>>@#15"",; ! A5%&")2(&%@$*##*(4+5%&:)2#%,4B5*4)",&0,%& &+$"5>+5)12#&+)$5+1"##%,4 slide by Matthew Bolitho
    • History! 0,<"5@2&%(%C"<:*5<6*5" ! D,(32&%(%C"<B5*4)",&0,%& ! D>&",)")+53E*,<6%<&:(%)%&"<! .*&:"5@E*#"<*(4+5%&:)#+,(37,+#1*&&"5;! 0#"<&:".5*$:%1#F/G slide by Matthew Bolitho
    • 9/0HI+#& History F$$(%1*&%+, 9+))*,< J*#&"5%C*&%+,.5*$:%1#I*5<6*5" !%#$(*3 !,&),-%2$ 1%(),/-$ +,%-,.$#/0- #/0- #/0- K")+53 K")+53 K")+53 slide by Matthew Bolitho
    • History! ."+)"&530,%&+$"5*&"#+,*$5%)%&%L"-1*, 65%&"E*1M&+)")+53! 9:*,4"#&+2,<"5(3%,4:*5<6*5"N ! FE%(%&3&+65%&"&+)")+53 ! /-#.0.)12&3+"4)((.#5&#.%( slide by Matthew Bolitho
    • CUDALanguage
    • gu age Lan! !"#$%&()*+%,%-,*./,.%01,0%)+%+)2)-,3%04% !5!66! $--47+%834.3,223+%04%,+)-9%24:%;)+0)*.% <4&%04%!"#$! =++*+%-,3*)*.%</3: !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&()*+%,%-,*./,.%01,0%)+%+)2)-,3%04% !5!66! !"#$%&$()*$+,,$-%../0/12$.0"3$$ &241-40-$+,,5 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&()*+%,%-,*./,.%01,0%)+%+)2)-,3%04% !5!66! >9*0,<0)<%;0*+)4*+? ! #<-,3,0)4*%@/,-)()3+ ! A/)-0B)*%C,3),D-+ ! A/)-0B)*%E98+ ! F;</0)4*%!4*()./3,0)4* !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! #<-+8< G%&<-,3,0)4*%+8<)()3 5%&<-,3,0)4*% H/,-)()3! $%24&)()3%,88-)&%04%&<-,3,0)4*+%4(? ! C,3),D-+ ! I/*<0)4*+! F;,28-+?%%!"#$%J%&%&(#J%$%)%*! !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • La ng uage! !"#$%/++%01%(4--47)*.%&<-,3,0)4*% H/,-)()3+%(43%:,3),D-+?! ++,&-*!&++! ++$.)(&,++! ++!"#$%)#%++! K*-9%,88-9%04%.-4D,-%:,3),D-+ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&"()*%)(%(+$,-%$(.%&/%-$"(/(),&"0(,1( )*"(0"./#"! 2*"(0%)%(&"/0"(/1(+$,-%$(3"3,&4! 5%($/6")/3"(,6()*"("1)/&"(%77$/#%)/,1! 8##"/-$"(),(%$$(9:;()*&"%0! 8##"/-$"(),()*"(<:;(./%(8:= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&"()*%)(%(+$,-%$(.%&/%-$"(/(),&"0(,1( )*"(0"./#"! 2*"(0%)%(&"/0"(/1(*%&"0(3"3,&4! 5%($/6")/3"(,6()*"()*&"%0(-$,#>! 8##"/-$"(),(%$$()*&"%0?(,1"(#,74(7"&()*&"%0( -$,#> !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! =6(1,)(0"#$%&"0(%(!"#$%&#?(&"%0(6&,3( 0/66"&"1)()*&"%0(%&"(1,)(.//-$"(@1$"(%( 41#*&,1/A%)/,1(-%&&/"&(@"0! B,)(%##"/-$"(6&,3(<:; !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&"()*%)(%(+$,-%$(.%&/%-$"(/(),&"0(,1( )*"(0"./#"! 2*"(0%)%(&"/0"(/1(#,1)%1)(3"3,&4! 5%($/6")/3"(,6("1)/&"(%77$/#%)/,1! 8##"/-$"(),(%$$(9:;()*&"%0(C&"%0(,1$4D! 8##"/-$"(),(<:;(./%(8:=(C&"%0EF&/)"D !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • La ng uage! <;!8(@"()*"(6,$$,F/1+(0"#$7"# 6,&( .%&/%-$"G! (()!&*((! ((+",%((! ((-#".$#(( !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&"()*%)(%(6@1#)/,1(/(#,37/$"0(),?(%10( "H"#@)"(,1()*"(0"./#"! <%$$%-$"(,1$4(6&,3(%1,)*"&(6@1#)/,1(,1()*"( 0"./#" !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&"()*%)(%(+,-#)./-(.(#/01.$"2()/(%-2( "3"#,)"(/-()*"(*/)! 4%$$%5$"(/-$6(+&/0(%-/)*"&()*"(*/)! 7,-#)./-(8.)*/,)(%-6(49!:(2"#$1"# %&"( */)(56(2"+%,$)! 4%-(,"(!!"#$%!! %-2(!!&()*!!+ )/;")*"& !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%&"()*%)(%(+,-#)./-(.(#/01.$"2()/(%-2( "3"#,)"(/-()*"(2"<.#"! 4%$$%5$"(+&/0()*"(*/)! 9"2(%()*"("-)&6(1/.-)(+&/0(*/)()/(2"<.#" !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! 49!:(1&/<.2"(%(")(/+(5,.$)=.-(<"#)/&()61">! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+ 0*",-2/+*",-3/+0*",-3/+! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+ $"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+ 0)4%2/+)4%3/+0)4%3/+! 5#46./+05#46./+5#461/+05#461/+5#462/+ 05#462/+5#463/+05#463/+! 75#,%./+75#,%1/+75#,%2/+75#,%3+ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! 4%-(#/-)&,#)(%(<"#)/&()61"(8.)*(1"#.%$( +,-#)./-> 8,9!!"#$%&(%):(;/+(.!"#$! 4%-(%##"("$"0"-)(/+(%(<"#)/&()61"(8.)*( !"#$%&!"$%&!"($%&!")$* (*(,-<= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! &)82 .(%(1"#.%$(<"#)/&()61"! ?%0"(%(0)4%2@("3#"1)(#%-(5"(#/-)&,#)"2( +&/0(%(#%$%&()/(+/&0(%(<"#)/&> :$*,5,-/+./+.> !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! 49!:(1&/<.2"(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"! %"-,&?&=@(@5#*9?&=@(@5#*9A)8@( 6-)&A)8! +,-.&/0&/&1&)822&34&10)4%22&! :##".5$"(/-$6(+&/0(2"<.#"(#/2"! 4%--/)()%A"(%22&"! 4%--/)(%.;-(<%$," !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • La ng uage! !"#$%&()*+,-%-./0120*2%-341%0(%513/26%06,% ,7,230*(/%(8%9,/,5-!"#$%%%&()*(+,-./0$1*(+!!!"#$%&()*+,-./ !"#$%%%&()*(+,-./0$1*(+!!!"#$%&()*+,-./!"#$%%%&()*(+,-./0$1*(+!!!"#$%&()*+,-./! !"#$ *-%1%%%&()*%%+83/20*(/! @6,%2(>&*5,%03/-%06*-%0.&,%(8%-010,>,/0% */0(%1%=5(29%(8%2(+,%0610%2(/8*43,-A%1/+% 513/26,-%06,%9,/,5 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • gu age Lan! !"#$%+,8*/,-%1%51/4314,%0610%*-%-*>*51%0(% !B!CC! D>&(01/0%#*88,,/2,-E ! F3/0*>,%G*=1. ! H3/20*(/- ! !51--,-A%I0320-A%"/*(/- !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • La ng uage! J$/$!DFG$:+9)2+6$(8+.+$)4$3$4(/2K! L0$:+1/&<(6$/<<$1&2()3$2/<<4$/.+$)<)+:! !/$&4+$!!"#$"%$"&!! (3$>.+9+($M!DFG$HIHN! G<<$<32/<$9/.)/-<+46$1&2()3$/.E&*+(4$/.+$ 4(3.+:$)$.+E)4(+.4! ( 1&2()3$.+2&.4)3! 53$1&2()3$>3)(+.4 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • La ng uage! !DFG$4&>>3.(4$43*+$!##$1+/(&.+4$13.$:+9)2+$ 23:+I$$OIE? ! =+*></(+$1&2()34! !</44+4$/.+$4&>>3.(+:$)4):+$I2&$43&.2+6$-&($ *&4($-+$834($3<0! P(.&2(4"D)34$A3.K$3$:+9)2+$23:+$/4$>+.$! !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • Common Runtime Component: Mathematical Functions a n gu age L• pow, sqrt, cbrt, hypot• exp, exp2, expm1• log, log2, log10, log1p• sin, cos, tan, asin, acos, atan, atan2• sinh, cosh, tanh, asinh, acosh, atanh• ceil, floor, trunc, round• Etc. – When executed on the host, a given function uses the C runtime implementation if available – These functions are only supported for scalar types, not vector types!"#$%&"(&)*+,-.#./"$0"120342&"15"678 169):$0$;".<<&0=&>;"/8?8>@"AB3CC;"CDDB
    • Device Runtime Component: a ng uage Mathematical Functions L • Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) – __pow – __log, __log2, __log10 – __exp – __sin, __cos, __tan!"#$%&"(&)*+,-.#./"$0"120342&"15"678 179):$0$;".<<&0=&>;"/8?8>@"AB3CC;"CDDB
    • CUDACompilation
    • m pila tion Co! !"#$%&()*+%,-.+&%+/0%-/%12*(3! !"#$%&#%(&)"*+,-&.,%#+/"0$."+,1+%$%! !"(2&3,+45!"##! !"## &06,%335%76%22,6%6"8#+%("6, ."(23,)."(2&3%$&"#26".,00 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • m pila tion Co!"#$%! 9"6(%3:.;:.22 0"86.,*&3,0! !<=>:.80"86.,."+,*&3,0&$%#$%! ?4@,.$1,),.8$%43,."+,*"6/"0$! :.84&# ,),.8$%43,."+,*"6$/,+,-&., !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • m pila tion Co! A"6:.%#+:.22 *&3,0;#-.. &#-"B,0$/,#%$&-, !1!CC."(2&3,6*"6$/,050$,(D,EFE..1.3G! 4)%2*(%,-.+&5%-6%-&%7%.-66.+%8)+%*89.-*76+0: !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • m pila tion Co ($ .22 ($ ( * .8+%*, .22 3&#B,6 .%$,( * .22 3&#B,6 )#$( #%+ ($,-"#-"2,#.. 2$)%0 .84&# !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • m pila tion Co! H"0,,$/,0$,202,6*"6(,+45#-..;80,$/, //0121$" %#+//344#5."((%#+3&#,"2$&"#0 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • m pila tion Co! !"#$%&$$()*+%, -%,./0$#%12$12/$3/&1$"4$12/$ 53"6378! 9,$+/: ! ;"0/0$&$$4%-/$1$3*,1%7/ ! <7+/00/0$%,$01$&/67/,1 ! <7+/00/0$&$$3/&"*3)/ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • E mu Floating Point • Results of floating-point computations will slightly differ because of: – Different compiler outputs, instruction sets – Use of extended precision for intermediate results • There are various options to force strict single precision on the host!"#$%&"(&)*+,-.#./"$0"120342&"15"6789):$0$;".<<&0=&>;"/8?8>@"AB3CC;"CDDB
    • Too lkit CUDA Toolkit Application Software Industry Standard C Language Libraries !"##$ !"%&( !")** GPU:card, system CUDA Compiler CUDA Tools Multicore CPU + !"#$#%& ()*++(#,,*-./01- 4 coresM02: High Performance Computing with CUDA 3
    • Too lkitCUDA Many-core + Multi-core support C CUDA Application NVCC NVCC --multicore Many-core Multi-core PTX code CPU C code PTX to Target gcc and Compiler MSVC Many-core Multi-coreM02: High Performance Computing with CUDA 5
    • Too lkit CUDA Compiler: nvcc Any source file containing CUDA language extensions (.cu) must be compiled with nvcc NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... NVCC can output: Either C code (CPU Code) That must then be compiled with the rest of the application using another tool Or PTX or object code directly An executable with CUDA code requires: The CUDA core library (cuda) The CUDA runtime library (cudart)M02: High Performance Computing with CUDA 6
    • Too lkit CUDA Compiler: nvcc Important flags: -arch sm_13 Enable double precision ( on compatible hardware) -G Enable debug for device code --ptxas-options=-v Show register and memory usage --maxrregcount <N> Limit the number of registers -use_fast_math Use fast math libraryM02: High Performance Computing with CUDA 7
    • Too lkit GPU Tools Profiler Available now for all supported OSs Command-line or GUI Sampling signals on GPU for: Memory access parameters Execution (serialization, divergence) Debugger Runs on the GPU Emulation mode Compile and execute in emulation on CPU Allows CPU-style debugging in GPU sourceM02: High Performance Computing with CUDA 35
    • CUDA API
    • A PI! !A"(DGHI(IMK(71/.$(19($A&""(B*&$2 ! !A"(A1$(IMK ! !A"(-"F.7"(IMK ! !A"(71))1/(IMK !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! !"#$%&($)*+,$(-.$/0*123#+$4567,2*6+$4*08 ! #127#$9:6:;#9#6, ! <#9*0=$9:6:;#9#6, ! >,0#:9$9:6:;#9#6, ! ?1#6,$9:6:;#9#6, ! !#@,50#$9:6:;9#6, ! A/#6BCD20#7,E$26,#0*/#0:F2G2,= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$ !"#$%& ! !"#$G*H$G#1#G$#127#$(-.$I/0#42@8$75J ! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$ *,"#0+$:0#$+/#72:G2M#3! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$ ,"#$":03H:0#L$H#$6##3$:$!"#$%"&%()"*)! #127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$ ,"0#:3+$IO5+,$G2P#$A/#6BCQJ ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$ 7*6,#@, ! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$ *6#$"*+,$,"0#:3 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$ 7*3#$*4$,=/#8$+,-"./0)! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$ 7*3#$*4$,=/#$%/!12---3)! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0! %/!14")51.)2---L$%/!14")2---6)-$(7 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#! #127#$(-.$7:GG+$95+,$7:GG$%/8($) !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$ :1:2G:FG#$3#127#+! %/9"#$%"4")+/()! %/9"#$%"4")! %/9"#$%"4"):1;"! %/9"#$%"4")<)10=";->! %/9"#$%"4")?))-$@/)"! ! !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! !"#$%&$%#(()$%*%+$,-#$%&-.%!"#$%&!$$( &$%/$.%*%+$,-#$%*"+0$%(1%.23$%)*+$%&!$! 4*"%"(&%#5$*.$%*%#(".$6.%&-.%!")(,)-$.($ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$% 1(5%#5$*.-"/%*%#(".$6.=! !"+.$(#$%&!$)/"0(! !"+.1$(#$%&!$! :"+%.$%8)$180=! !"+.)2//3$#$%&!$ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI Device Management CPU can query and select GPU devices cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop ) Multi-GPU setup: device 0 is used by default one CPU thread can control one GPU multiple CPU threads can control the same GPU – calls are serialized by the driverM02: High Performance Computing with CUDA 28
    • A PI! !"#$%&$%*,$%*%#(".$6.%>)*!/0($,(?%#*"% *00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.%#5$*.-"/% .5$*+! D(%)2"#5("-E$%*00%.5$*+)%>4;B%().%&-.% A;B%.5$*+)?%#*00%!")(,140!2-/0&5$! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)% !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! :00(#*.$HI5$$%9$9(52=! !"6$7899/!:;!"6$7<-$$! <"-.-*0-E$%9$9(52=! !"6$73$(! 4(32%9$9(52=! !"6$7!=4>(/#:;!"6$7!=4#(/>:; !"6$7!=4#(/# !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! F$"%*00(#*.-"/%9$9(52%1(5%.$%2/3(@%#*"% 8)$%!"##$% H%&( H%!!")! !5%8)$%!"6$7899/!>/3(@%!"6$7<-$$>/3(! D$)$%18"#.-(")%*00(#*.$%().%9$9(52%.*.%-)% )"*+#$%,-! ;$51(59*"#$%-935(,$+%1(5%#(32%.(H15(9% 3*/$J0(#G$+%().%9$9(52 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! :00(#*.$HI5$$%9$9(52=! !"+.6.99/!@%!"+.<-$$! <"-.-*0-E$%9$9(52=! !"+.6$73$(! 4(32%9$9(52=! !"+.6$7!=4 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! !"#$%&(!!"#$%"&(%&$#"!"#$%& )#)(*+! ,&-"&.("&(%&$#"%&&% )#)(*+"/012! 3**&+."&*#"%*#&$#4"56$7"&".8#%696%"564$7"&-4" 7#6:7$"&-4"##)#-$"$+8#! ;#)(*+"&+(<$"6."(8$6)6=#4"/#>:>"8&%?6-:2"@+" *<-$6)#! !"&))*+,)$*-$! !"&))*+.$/-)(+! !"#$%!0+.-(&! !"#$%!0+1-(&!"# !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! 3")(4<#"6."&"@(@"(9"ABC"%(4#D4&$&"&(-:" 56$7".()#"$+8#"6-9(*)&$6(- ! >%<@6- 96#.! 3")(4<#"6."%*#&$#4"@+"(&46-:"&"%<@6- 56$7" !"#(2"$,)$*-$ (*"!"#(2"$3(*2.*-*! ;(4<#"%&-"@#"<-(&4#4"56$7" !"#(2"$45(*2 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! E(&46-:"&")(4<#"&.("%(86#."6$"$("$7#"4#F6%#! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4" :(@&"F&*6&@#.G !"#(2"$6$-7"5!-8(5 !"#(2"$6$-6(9* !"#(2"$6$-:$;<$= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! H-%#"&")(4<#"6."(&4#4!"&-4"5#"7&F#"&" 9<-%$6(-"8(6-$#*!"5#"%&-"%&"&"9<-%$6(-! I#")<.$".#$<8"$7#"!"!#$%&()!(*&+(,!(%) 96*.$ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! JK#%<$6(-"#-F6*(-)#-$"6-%<4#.G " L7*#&4"M(%?"N6=# " N7&*#4";#)(*+"N6=# " O<-%$6(-"B&*&)#$#*. " A*64"N6=# !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! L7*#&4"M(%?"N6=#G" !"7"5!>$-?(!@>A*0$! N7&*#4";#)(*+"N6=#G !"7"5!>$->A*)$2>8B$! O<-%$6(-"B&*&)#$#*.G !"C*)*%>$->8B$DE!"C*)*%>$-8DE !"C*)*%>$-=DE!"C*)*%>$-F !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! !"#$%&#(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(% ./01*#20%#0321+*#204 !"#$"%!&()* !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! +,!$--. !"#$%&#()*+,#-*#%."#&+*/#01*#2223444# &"%0(5"#("65%.0(5"#758*9.059:! 5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;% *2%)(*/6%*,(%(<(1/*#20%(03#"20-(0* !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*% 211/"%#0%2"$("%%>?8? @? A26B%$+*+%."2-%,2)*%*2%$(3#1( C? ><(1/*(%$(3#1(%./01*#20% D? A26B%$+*+%."2-%$(3#1(%*2%,2)* !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*% 211/"%#0%2"$("! E#..("(0*%)*"(+-)%1+0%F(%/)($%*2%-+0+8(% 1201/""(01B%%>?8? G3("7+66#08%-(-2"B%126B%."2-%20(%)*"(+-% H#*,%*,(%./01*#20%(<(1/*#20%."2-%+02*,(" !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
    • A PI! <="4$;,&",8,J3D+"$"&:2424H$F"E&3H&";; 3D,;$&",:! !"#$%&()*"+,#-./-)0#)1+$-&%)#-/-%- ;E"#2D2#E3;2$234! -F3O+"&3D,4"="4$F,4+O"#,4) ! P,2$D3&,4"="4$$33##%& ! Q",;%&"$F"$2:"$F,$3##%&&"+N"$8""4$83 "="4$; !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0