SlideShare a Scribd company logo
CODE GPU WITH CUDA
SIMT
NVIDIA GPU ARCHITECTURE
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
BACK TO CONTENTS
OUTLINE
Hardware revisions
SIMT architecture
Warp scheduling
Divergence & convergence
Predicated execution
Conditional execution
OUT OF SCOPE
Computer graphics capabilities
HARDWARE REVISIONS
SM (shading model) – particular hardware implementation.
Generation SM GPU models
Tesla sm_10 G80 G92(b) G94(b)
sm_11 G86 G84 G98 G96(b) G94(b) G92(b)
sm_12 GT218 GT216 GT215
sm_13 GT200 GT200b
Fermi sm_20 GF100 GF110
sm_21 GF104 GF114 GF116 GF108 GF106
Kepler sm_30 GK104 GK106 GK107
sm_32 GK20A
sm_35 GK110 GK208
sm_37 GK210
Maxwell sm_50 GM107 GM108
sm_52 GM204
sm_53 GM20B
LATENCY VS THROUGHPUT ARCHITECTURES
Modern CPUs and GPUs are both multi-core systems.
CPUs are latency oriented:
Pipelining, out-of-order, superscalar
Caching, on-die memory controllers
Speculative execution, branch prediction
Compute cores occupy only a small part of a die
GPUs are throughput oriented:
100s simple compute cores
Zero cost scheduling of 1000s or threads
Compute cores occupy most part of a die
SIMD – SIMT – SMT
Single Instruction Multiple Thread
SIMD: elements of short vectors are processed in parallel. Represents problem as short
vectors and processes it vector by vector. Hardware support for wide arithmetic.
SMT: instructions from several threads are run in parallel. Represents problem as scope
of independent tasks and assigns them to different threads. Hardware support for multi-
threading.
SIMT vector processing + light-weight threading:
Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32-
lane wide
thread scheduling and fast context switching between different warps to minimize
stalls
SIMT
DEPTH OF MULTI-THREADING × WIDTH OF SIMD
1. SIMT is abstraction over vector hardware:
Threads are grouped into warps (32 for NVIDIA)
A thread in a warp usually called lane
Vector register file. Registers accessed line by line.
A lane loads laneId’s element from register
Single program counter (PC) for whole warp
Only a couple of special registers, like PC, can be scalar
2. SIMT HW is responsible for warp scheduling:
Static for all latest hardware revisions
Zero overhead on context switching
Long latency operation score-boarding
SASS ISA
SIMT is like RISC
Memory instructions are separated from arithmetic
Arithmetic performed only on registers and immediates
SIMT PIPELINE
Warp scheduler manages warps, selects ready to execute
Fetch/decode unit is associated with warp scheduler
Execution units are SC, SFU, LD/ST
Area-/power-efficiency thanks to regularity.
VECTOR REGISTER FILE
~Zero warp switching requires a big vector register file (RF)
While warp is resident on SM it occupies a portion of RF
GPU's RF is 32-bit. 64-bit values are stored in register pair
Fast switching costs register wastage on duplicated items
Narrow data types are as costly as wide data types.
Size of RF depends on architecture. Fermi: 128 KB per SM, Kepler: 256 KB per SM,
Maxwell: 64 KB per scheduler.
DYNAMIC VS STATIC SCHEDULING
Static scheduling
instructions are fetched, executed & completed in compiler-generated order
In-order execution
in case one instruction stalls, all following stall too
Dynamic scheduling
instructions are fetched in compiler-generated order
instructions are executed out-of-order
Special unit to track dependencies and reorder instructions
independent instructions behind a stalled instruction can pass it
WARP SCHEDULING
GigaThread subdivide work between SMs
Work for SM is sent to Warp Scheduler
One assigned warp can not migrate between schedulers
Warp has own lines in register file, PC, activity mask
Warp can be in one of the following states:
Executed - perform operation
Ready - wait to be executed
Wait - wait for resources
Resident - wait completion of other warps within the same block
WARP SCHEDULING
Depending on generation scheduling is dynamic (Fermi) or static (Kepler, Maxwell)
WARP SCHEDULING (CONT)
Modern warp schedulers support dual
issue (sm_21+) to decode instruction pair
for active warp per clock
SM has 2 or 4 warp schedulers depending
on the architecture
Warps belong to blocks. Hardware tracks
this relations as well
DIVERGENCE & (RE)CONVERGENCE
Divergence: not all lanes in a warp take the same code path
Convergence handled via convergence stack
Convergence stack entry includes
convergence PC
next-path PC
lane mask (mark active lanes on that path)
SSY instruction pushes convergence stack. It occurs before potentially divergent
instructions
<INSTR>.S indicates convergence point – instruction after which all lanes in a warp take
the same code path
DIVERGENT CODE EXAMPLE
( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;
/ * 0 0 5 0 * / S S Y 0 x 8 0 ;
/ * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ;
/ * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ;
/ * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ;
/ * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ;
/ * 0 0 7 8 * / N O P . S ;
Assume warp size == 4
PREDICATED & CONDITIONAL EXECUTION
Predicated execution
Frequently used for if-then statements, rarely for if-then-else. Decision is made by
compiler heuristic.
Optimizes divergence overhead.
Conditional execution
Compare instruction sets condition code (CC) registers.
CC is 4-bit state vector (sign, carry, zero, overflow)
No WB stage for CC-marked registers
Used in Maxwell to skip unneeded computations for arithmetic operations
implemented in hardware with multiple instructions
I M A D R 8 . C C , R 0 , 0 x 4 , R 3 ;
FINAL WORDS
SIMT is RISC-based throughput oriented architecture
SIMT combines vector processing and light-weight threading
SIMT instructions are executed per warp
Warp has its own PC and activity mask
Branching is done by divergence, predicated or conditional execution
THE END
NEXT
BY / 2013–2015CUDA.GEEK

More Related Content

What's hot

SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
Siraj Muhammad
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
Ayush Singh, MS
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
venkatasuman1983
 
Vlsi Synthesis
Vlsi SynthesisVlsi Synthesis
Vlsi Synthesis
SIVA NAGENDRA REDDY
 
Hard ip based SoC design
Hard ip based SoC designHard ip based SoC design
Hard ip based SoC design
Vinchipsytm Vlsitraining
 
Introduction to FPGAs
Introduction to FPGAsIntroduction to FPGAs
Introduction to FPGAs
Sudhanshu Janwadkar
 
Asic vs fpga
Asic vs fpgaAsic vs fpga
Asic vs fpga
Shalini Kamade
 
Vlsi physical design
Vlsi physical designVlsi physical design
Vlsi physical designI World Tech
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
ASIC DESIGN : PLACEMENT
ASIC DESIGN : PLACEMENTASIC DESIGN : PLACEMENT
ASIC DESIGN : PLACEMENT
helloactiva
 
Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)
A B Shinde
 
A comparative study of full adder using static cmos logic style
A comparative study of full adder using static cmos logic styleA comparative study of full adder using static cmos logic style
A comparative study of full adder using static cmos logic style
eSAT Publishing House
 
Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...
Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...
Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...
Usha Mehta
 
Pass transistor logic
Pass transistor logicPass transistor logic
Pass transistor logic
student
 
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
Complex Programmable Logic Device (CPLD) Architecture and Its ApplicationsComplex Programmable Logic Device (CPLD) Architecture and Its Applications
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
elprocus
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway Application
Michelle Holley
 
Placement and routing in full custom physical design
Placement and routing in full custom physical designPlacement and routing in full custom physical design
Placement and routing in full custom physical designDeiptii Das
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
Satya Harish
 

What's hot (20)

SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
Vlsi Synthesis
Vlsi SynthesisVlsi Synthesis
Vlsi Synthesis
 
Hard ip based SoC design
Hard ip based SoC designHard ip based SoC design
Hard ip based SoC design
 
Introduction to FPGAs
Introduction to FPGAsIntroduction to FPGAs
Introduction to FPGAs
 
Asic vs fpga
Asic vs fpgaAsic vs fpga
Asic vs fpga
 
Vlsi physical design
Vlsi physical designVlsi physical design
Vlsi physical design
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
ASIC DESIGN : PLACEMENT
ASIC DESIGN : PLACEMENTASIC DESIGN : PLACEMENT
ASIC DESIGN : PLACEMENT
 
Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)
 
A comparative study of full adder using static cmos logic style
A comparative study of full adder using static cmos logic styleA comparative study of full adder using static cmos logic style
A comparative study of full adder using static cmos logic style
 
Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...
Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...
Deterministic Test Pattern Generation ( D-Algorithm of ATPG) (Testing of VLSI...
 
Pass transistor logic
Pass transistor logicPass transistor logic
Pass transistor logic
 
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
Complex Programmable Logic Device (CPLD) Architecture and Its ApplicationsComplex Programmable Logic Device (CPLD) Architecture and Its Applications
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway Application
 
Placement and routing in full custom physical design
Placement and routing in full custom physical designPlacement and routing in full custom physical design
Placement and routing in full custom physical design
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
 

Viewers also liked

Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
Marina Kolpakova
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
Marina Kolpakova
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
Marina Kolpakova
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniques
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 

Viewers also liked (9)

Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniques
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 

Similar to Code GPU with CUDA - SIMT

Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Hsien-Hsin Sean Lee, Ph.D.
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
RISC-V International
 
0507036
05070360507036
0507036
meraz rizel
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingTom Spyrou
 
underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps
Mohd Sohail
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
Sathish Arumugasamy
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
Santosh Verma
 
ARM stacks, subroutines, Cortex M3, LPC 214X
ARM  stacks, subroutines, Cortex M3, LPC 214XARM  stacks, subroutines, Cortex M3, LPC 214X
ARM stacks, subroutines, Cortex M3, LPC 214X
Karthik Vivek
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
Atef46
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor
Toru Nishimura
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareDaniel Blezek
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center supportKrunal Shah
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
KandavelEee
 
Unit2 arm
Unit2 armUnit2 arm
Unit2 arm
Karthik Vivek
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5Steen Larsen
 

Similar to Code GPU with CUDA - SIMT (20)

Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
0507036
05070360507036
0507036
 
Final_Report
Final_ReportFinal_Report
Final_Report
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
 
Pipelining1
Pipelining1Pipelining1
Pipelining1
 
underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
 
ARM stacks, subroutines, Cortex M3, LPC 214X
ARM  stacks, subroutines, Cortex M3, LPC 214XARM  stacks, subroutines, Cortex M3, LPC 214X
ARM stacks, subroutines, Cortex M3, LPC 214X
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
Unit2 arm
Unit2 armUnit2 arm
Unit2 arm
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
Aes
AesAes
Aes
 

Recently uploaded

Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 

Recently uploaded (20)

Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 

Code GPU with CUDA - SIMT

  • 1. CODE GPU WITH CUDA SIMT NVIDIA GPU ARCHITECTURE CreatedbyMarinaKolpakova( )forcuda.geek Itseez BACK TO CONTENTS
  • 2. OUTLINE Hardware revisions SIMT architecture Warp scheduling Divergence & convergence Predicated execution Conditional execution
  • 3. OUT OF SCOPE Computer graphics capabilities
  • 4. HARDWARE REVISIONS SM (shading model) – particular hardware implementation. Generation SM GPU models Tesla sm_10 G80 G92(b) G94(b) sm_11 G86 G84 G98 G96(b) G94(b) G92(b) sm_12 GT218 GT216 GT215 sm_13 GT200 GT200b Fermi sm_20 GF100 GF110 sm_21 GF104 GF114 GF116 GF108 GF106 Kepler sm_30 GK104 GK106 GK107 sm_32 GK20A sm_35 GK110 GK208 sm_37 GK210 Maxwell sm_50 GM107 GM108 sm_52 GM204 sm_53 GM20B
  • 5. LATENCY VS THROUGHPUT ARCHITECTURES Modern CPUs and GPUs are both multi-core systems. CPUs are latency oriented: Pipelining, out-of-order, superscalar Caching, on-die memory controllers Speculative execution, branch prediction Compute cores occupy only a small part of a die GPUs are throughput oriented: 100s simple compute cores Zero cost scheduling of 1000s or threads Compute cores occupy most part of a die
  • 6. SIMD – SIMT – SMT Single Instruction Multiple Thread SIMD: elements of short vectors are processed in parallel. Represents problem as short vectors and processes it vector by vector. Hardware support for wide arithmetic. SMT: instructions from several threads are run in parallel. Represents problem as scope of independent tasks and assigns them to different threads. Hardware support for multi- threading. SIMT vector processing + light-weight threading: Warp is a unit of execution. It performs the same instruction each cycle. Warp is 32- lane wide thread scheduling and fast context switching between different warps to minimize stalls
  • 7. SIMT DEPTH OF MULTI-THREADING × WIDTH OF SIMD 1. SIMT is abstraction over vector hardware: Threads are grouped into warps (32 for NVIDIA) A thread in a warp usually called lane Vector register file. Registers accessed line by line. A lane loads laneId’s element from register Single program counter (PC) for whole warp Only a couple of special registers, like PC, can be scalar 2. SIMT HW is responsible for warp scheduling: Static for all latest hardware revisions Zero overhead on context switching Long latency operation score-boarding
  • 8. SASS ISA SIMT is like RISC Memory instructions are separated from arithmetic Arithmetic performed only on registers and immediates
  • 9. SIMT PIPELINE Warp scheduler manages warps, selects ready to execute Fetch/decode unit is associated with warp scheduler Execution units are SC, SFU, LD/ST Area-/power-efficiency thanks to regularity.
  • 10. VECTOR REGISTER FILE ~Zero warp switching requires a big vector register file (RF) While warp is resident on SM it occupies a portion of RF GPU's RF is 32-bit. 64-bit values are stored in register pair Fast switching costs register wastage on duplicated items Narrow data types are as costly as wide data types. Size of RF depends on architecture. Fermi: 128 KB per SM, Kepler: 256 KB per SM, Maxwell: 64 KB per scheduler.
  • 11. DYNAMIC VS STATIC SCHEDULING Static scheduling instructions are fetched, executed & completed in compiler-generated order In-order execution in case one instruction stalls, all following stall too Dynamic scheduling instructions are fetched in compiler-generated order instructions are executed out-of-order Special unit to track dependencies and reorder instructions independent instructions behind a stalled instruction can pass it
  • 12. WARP SCHEDULING GigaThread subdivide work between SMs Work for SM is sent to Warp Scheduler One assigned warp can not migrate between schedulers Warp has own lines in register file, PC, activity mask Warp can be in one of the following states: Executed - perform operation Ready - wait to be executed Wait - wait for resources Resident - wait completion of other warps within the same block
  • 13. WARP SCHEDULING Depending on generation scheduling is dynamic (Fermi) or static (Kepler, Maxwell)
  • 14. WARP SCHEDULING (CONT) Modern warp schedulers support dual issue (sm_21+) to decode instruction pair for active warp per clock SM has 2 or 4 warp schedulers depending on the architecture Warps belong to blocks. Hardware tracks this relations as well
  • 15. DIVERGENCE & (RE)CONVERGENCE Divergence: not all lanes in a warp take the same code path Convergence handled via convergence stack Convergence stack entry includes convergence PC next-path PC lane mask (mark active lanes on that path) SSY instruction pushes convergence stack. It occurs before potentially divergent instructions <INSTR>.S indicates convergence point – instruction after which all lanes in a warp take the same code path
  • 16. DIVERGENT CODE EXAMPLE ( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ; / * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ; Assume warp size == 4
  • 17. PREDICATED & CONDITIONAL EXECUTION Predicated execution Frequently used for if-then statements, rarely for if-then-else. Decision is made by compiler heuristic. Optimizes divergence overhead. Conditional execution Compare instruction sets condition code (CC) registers. CC is 4-bit state vector (sign, carry, zero, overflow) No WB stage for CC-marked registers Used in Maxwell to skip unneeded computations for arithmetic operations implemented in hardware with multiple instructions I M A D R 8 . C C , R 0 , 0 x 4 , R 3 ;
  • 18. FINAL WORDS SIMT is RISC-based throughput oriented architecture SIMT combines vector processing and light-weight threading SIMT instructions are executed per warp Warp has its own PC and activity mask Branching is done by divergence, predicated or conditional execution
  • 19. THE END NEXT BY / 2013–2015CUDA.GEEK