A basic introduction to GPU architecture. Based on Kayvon\'s "From Shader Code to a Teraflop: How GPU Shader Cores Work"
Updated to include the latest GPUs: AMD Tahiti (HD7970) and NVIDIA Kepler (GTX690)
This set of slide decks was prepared by me and Amit for use in demonstrating Flyweight Pattern in our Design Pattern class at SU. You can download the code examples (C++ visual studio projects) from here: http://sdrv.ms/TsvEQl
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
Arm tools and roadmap for SVE compiler supportLinaro
Â
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
Â
The talk will look at limitations of compilers when creating fast code and how to make more effective use of both the underlying micro-architecture of modern CPU's and how algorithmic optimizations may have surprising effects on the generated code. We shall discuss several specific CPU architecture features and their pros and cons in relation to creating fast C++ code. We then expand with several algorithmic techniques, not usually well-documented, for making faster, compiler friendly, C++.
Note that we shall not discuss caching and related issues here as they are well documented elsewhere.
This set of slide decks was prepared by me and Amit for use in demonstrating Flyweight Pattern in our Design Pattern class at SU. You can download the code examples (C++ visual studio projects) from here: http://sdrv.ms/TsvEQl
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
Arm tools and roadmap for SVE compiler supportLinaro
Â
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
Â
The talk will look at limitations of compilers when creating fast code and how to make more effective use of both the underlying micro-architecture of modern CPU's and how algorithmic optimizations may have surprising effects on the generated code. We shall discuss several specific CPU architecture features and their pros and cons in relation to creating fast C++ code. We then expand with several algorithmic techniques, not usually well-documented, for making faster, compiler friendly, C++.
Note that we shall not discuss caching and related issues here as they are well documented elsewhere.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
Â
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Â
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
Â
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalitĂ di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
đź“• Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨‍🏫👨‍💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
Â
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
Â
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Enhancing Performance with Globus and the Science DMZGlobus
Â
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
Â
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
Â
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Â
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
Â
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
Â
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
1. Introduction to GPU Compute
Architecture
Ofer Rosenberg
Based on
“From Shader Code to a Teraflop: How GPU Shader Cores Work”,
By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
2. Intro: some numbers…
Sources:
http://www.anandtech.com/show/6025/radeon-hd-7970-ghz-
edition-review-catching-up-to-gtx-680
http://ark.intel.com/products/65722
http://www.techpowerup.com/cpudb/1028/Intel_Xeon_E3-
1290V2.html
Intel IvyBridge 4C AMD Radeon HD 7970 Ratio
(E3-1290V2 )
Cores 4 32
Frequency 3700MHz 1000MHz
Process 22nm 28nm
Transistor Count 1400M 4310M ~3x
Power 87W 250W ~3x
Compute Power 237 GFLOPS 4096 GFLOPS ~17x
2
3. Content
1. Three major ideas that make GPU processing cores run
fast
2. Closer look at real GPU designs
– NVIDIA GTX 680
– AMD Radeon 7970
3. The GPU memory hierarchy: moving data to processors
4. Heterogeneous Cores
4. Part 1: throughput processing
• Three key concepts behind how modern GPU
processing cores run code
• Knowing these concepts will help you:
1. Understand space of GPU core (and throughput CPU core) designs
2. Optimize shaders/compute kernels
3. Establish intuition: what workloads might benefit from the design of
these architectures?
5. What’s in a GPU?
A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)
Shader Shader Input Assembly
Tex
Core Core
Rasterizer
Shader Shader Tex
Core Core Output Blend
Shader Shader Video Decode
Tex
Core Core
Work HW
Shader Shader Tex Distributor or
Core Core SW?
6. A diffuse reflectance shader
sampler mySamp;
Texture2D<float3> myTex;
float3 lightDir; Shader programming model:
float4 diffuseShader(float3 norm, float2 uv) Fragments are processed
{
independently,
float3 kd;
but there is no explicit parallel
kd = myTex.Sample(mySamp, uv);
kd *= clamp( dot(lightDir, norm), 0.0, 1.0);
programming
return float4(kd, 1.0);
}
8. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
9. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
10. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
11. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
12. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
13. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
14. Execute shader
Fetch/
Decode
<diffuseShader>:
sample r0, v4, t0, s0
ALU mul r3, v0, cb0[0]
(Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
Execution mul o0, r0, r3
mul o1, r1, r3
Context
mul o2, r2, r3
mov o3, l(1.0)
15. “CPU-style” cores
Fetch/
Decode
Data cache
(a big one)
ALU
(Execute)
Execution Out-of-order control logic
Context
Fancy branch predictor
Memory pre-fetcher
16. Slimming down
Fetch/
Decode
Idea #1:
ALU
(Execute) Remove components that
help a single instruction
Execution
Context stream run fast
17. Two cores (two fragments in parallel)
fragment 1 fragment 2
Fetch/ Fetch/
Decode Decode
<diffuseShader>: <diffuseShader>:
sample r0, v4, t0, s0 ALU ALU sample r0, v4, t0, s0
mul r3, v0, cb0[0] mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3 (Execute) (Execute) madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3 madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0) clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3 mul o0, r0, r3
mul o1, r1, r3 Execution Execution mul o1, r1, r3
mul o2, r2, r3 mul o2, r2, r3
mov o3, l(1.0) Context Context mov o3, l(1.0)
18. Four cores (four fragments in parallel)
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
22. Add ALUs
Fetch/
Idea #2:
Decode Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction
ALU 5 ALU 6 ALU 7 ALU 8 stream across many ALUs
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
SIMD processing
Shared Ctx Data
23. Modifying the shader
Fetch/
Decode <diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
ALU 1 ALU 2 ALU 3 ALU 4
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
ALU 5 ALU 6 ALU 7 ALU 8
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
Ctx Ctx Ctx Ctx mul o2, r2, r3
mov o3, l(1.0)
Ctx Ctx Ctx Ctx
Original compiled shader:
Shared Ctx Data
Processes one fragment using
scalar ops on scalar registers
24. Modifying the shader
Fetch/
Decode <VEC8_diffuseShader>:
VEC8_sample vec_r0, vec_v4, t0, vec_s0
VEC8_mul vec_r3, vec_v0, cb0[0]
ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
VEC8_mul vec_o0, vec_r0, vec_r3
VEC8_mul vec_o1, vec_r1, vec_r3
VEC8_mul vec_o2, vec_r2, vec_r3
Ctx Ctx Ctx Ctx VEC8_mov o3, l(1.0)
Ctx Ctx Ctx Ctx
New compiled shader:
Shared Ctx Data
Processes eight fragments using
vector ops on vector registers
25. Modifying the shader
1 2 3 4
5 6 7 8
Fetch/
Decode <VEC8_diffuseShader>:
VEC8_sample vec_r0, vec_v4, t0, vec_s0
VEC8_mul vec_r3, vec_v0, cb0[0]
ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
VEC8_mul vec_o0, vec_r0, vec_r3
VEC8_mul vec_o1, vec_r1, vec_r3
VEC8_mul vec_o2, vec_r2, vec_r3
Ctx Ctx Ctx Ctx VEC8_mov o3, l(1.0)
Ctx Ctx Ctx Ctx
Shared Ctx Data
26. 128 fragments in parallel
16 cores = 128 ALUs , 16 simultaneous instruction streams
27. vertices/fragments
128 [ primitives
OpenCL work items ] in parallel
vertices
primitives
fragments
28. But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
shader code>
if (x > 0) {
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
} else {
x = 0;
refl = Ka;
}
<resume unconditional
shader code>
29. But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
shader code>
T T F T F F F F if (x > 0) {
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
} else {
x = 0;
refl = Ka;
}
<resume unconditional
shader code>
30. But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
shader code>
T T F T F F F F if (x > 0) {
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
} else {
x = 0;
refl = Ka;
}
<resume unconditional
Not all ALUs do useful work! shader code>
Worst case: 1/8 peak
performance
31. But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
shader code>
T T F T F F F F if (x > 0) {
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
} else {
x = 0;
refl = Ka;
}
<resume unconditional
shader code>
32. Clarification
SIMD processing does not imply SIMD instructions
• Option 1: explicit vector instructions
– x86 SSE, AVX, Intel Larrabee
• Option 2: scalar instructions, implicit HW vectorization
– HW determines instruction stream sharing across ALUs (amount of sharing
hidden from software)
– NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures (“wavefronts”)
In practice: 16 to 64 fragments share an instruction stream.
33. Stalls!
Stalls occur when a core cannot run the next instruction
because of a dependency on a previous operation.
Texture access latency = 100’s to 1000’s of cycles
We’ve removed the fancy caches and logic that helps avoid stalls.
34. But we have LOTS of independent fragments.
Idea #3:
Interleave processing of many fragments on a single
core to avoid stalls caused by high latency operations.
35. Hiding shader stalls
Time (clocks) Frag 1 … 8
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
36. Hiding shader stalls
Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32
1 2 3 4
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
1 2
3 4
43. Four large contexts (low latency hiding ability)
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
1 2
3 4
44. Clarification
Interleaving between contexts can be managed by
hardware or software (or both!)
• NVIDIA / AMD Radeon GPUs
– HW schedules / manages all contexts (lots of them)
– Special on-chip storage holds fragment state
• Intel Larrabee
– HW manages four x86 (big) contexts at fine granularity
– SW scheduling interleaves many groups of fragments on each HW context
– L1-L2 cache holds fragment state (as determined by SW)
45. Example chip
16 cores
8 mul-add ALUs per core
(128 total)
16 simultaneous
instruction streams
64 concurrent (but interleaved)
instruction streams
512 concurrent fragments
= 256 GFLOPs (@ 1GHz)
46. Summary: three key ideas
1. Use many “slimmed down cores” to run in parallel
2. Pack cores full of ALUs (by sharing instruction stream across
groups of fragments)
– Option 1: Explicit SIMD vector instructions
– Option 2: Implicit sharing managed by hardware
3. Avoid latency stalls by interleaving execution of many groups of
fragments
– When one group stalls, work on another group
47. Part 2:
Putting the three ideas into practice:
A closer look at real GPUs
NVIDIA GeForce GTX 680
AMD Radeon HD 7970
48. Disclaimer
• The following slides describe “a reasonable way to think”
about the architecture of commercial GPUs
• Many factors play a role in actual chip performance
49. NVIDIA GeForce GTX 680 (Kepler)
• NVIDIA-speak:
– 1536 stream processors (“CUDA cores”)
– “SIMT execution”
• Generic speak:
– 8 cores
– 6 groups of 32-wide SIMD functional units per core
50. NVIDIA GeForce GTX 680 “core”
Fetch/
Decode = SIMD function unit,
control shared across 32 units
(1 MUL-ADD per clock)
• Groups of 32 [fragments/vertices/CUDA
threads] share an instruction stream
• Up to 64 groups are simultaneously
interleaved
Execution contexts • Up to 2048 individual contexts can be
(256 KB) stored
“Shared” memory
(16+48 KB)
Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
51. NVIDIA GeForce GTX 680 “core”
Fetch/
Decode = SIMD function unit,
control shared across 32 units
Fetch/
Decode (1 MUL-ADD per clock)
Fetch/
Decode
• The core contains 192 functional units
Fetch/
Decode
• Six groups are selected each clock
Fetch/
Decode (decode, fetch, and execute six
Fetch/ instruction streams in parallel)
Decode
Execution contexts
(128 KB)
“Shared” memory
(16+48 KB)
Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
52. NVIDIA GeForce GTX 680 “SMX”
Fetch/
Decode = CUDA core
(1 MUL-ADD per clock)
Fetch/
Decode
Fetch/
Decode
• The SMX contains 192 CUDA cores
Fetch/
Decode
• Six warps are selected each clock
Fetch/
Decode (decode, fetch, and execute six
Fetch/ warps in parallel)
Decode
Execution contexts • Up to 64 warps are interleaved,
(128 KB) totaling 2048 CUDA threads
“Shared” memory
(16+48 KB)
Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
53. NVIDIA GeForce GTX 680
There are 8 of
these things on the
GTX 680:
That’s 16,384
fragments!
Or 16,384 CUDA
threads!
54. AMD Radeon HD 7970 (Tahiti)
• AMD-speak:
– 2048 stream processors
– “GCN”: Graphics Core Next
• Generic speak:
– 32 cores
– 4 groups of 64-wide SIMD functional units per core
55. AMD Radeon HD 7970 “core”
Fetch/ = SIMD function unit,
Decode
control shared across 64 units
(1 MUL-ADD per clock)
• Groups of 64
[fragments/vertices/OpenCL Workitems]
share an instruction stream
• Up to 40 groups are simultaneously
interleaved
Execution Execution Execution Execution
context context context context
64 KB 64 KB 64 KB 64 KB • Up to 2560 individual contexts can be
“Shared” memory stored
(64 KB)
Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
56. AMD Radeon HD 7970 “core”
Fetch/ = SIMD function unit,
Decode
control shared across 64 units
(1 MUL-ADD per clock)
Fetch/
Decode
• The core contains 64 functional units
Fetch/
Decode • Four clocks to execute an instruction
for all fragments in a group
Fetch/
Decode
• Four groups are executed in parallel
Execution
context
Execution
context
Execution
context
Execution
context
each clock (decode, fetch, and
64 KB 64 KB 64 KB 64 KB execute four instruction streams in
“Shared” memory parallel)
(64 KB)
Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
57. AMD Radeon HD 7970 “Compute Unit”
Fetch/ = SIMD function unit,
Decode
control shared across 64 units
(1 MUL-ADD per clock)
Fetch/
Decode
• Groups of 64 [fragments/vertices/etc.]
Fetch/ are a wavefront
Decode
• Four wavefronts are executed each
Fetch/
Decode clock
Execution
context
Execution
context
Execution
context
Execution
context
• Up to 40 wavefronts are interleaved,
64 KB 64 KB 64 KB 64 KB totaling 2560 Threads
“Shared” memory
(64 KB)
Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
58. AMD Radeon HD 7970
There are 32 of these things on the HD 7970: That’s 81,920 fragments!
59. The talk thus far: processing data
Part 3: moving data to processors
60. Recall: “CPU-style” core
OOO exec logic
Branch predictor
Fetch/Decode
Data cache
ALU (a big one)
Execution
Context
61. “CPU-style” memory hierarchy
OOO exec logic
L1 cache
Branch predictor (32 KB)
25 GB/sec
Fetch/Decode to memory
L3 cache
ALU (8 MB)
shared across cores
Execution
contexts
L2 cache
(256 KB)
CPU cores run efficiently when data is resident in cache
(caches reduce latency, provide high bandwidth)
62. Throughput core (GPU-style)
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
280 GB/sec
ALU 5 ALU 6 ALU 7 ALU 8
Memory
Execution
contexts
(256 KB)
More ALUs, no large traditional cache hierarchy:
Need high-bandwidth connection to memory
63. Bandwidth is a critical resource
– A high-end GPU (e.g. Radeon HD 7970) has...
• Over twenty times (4.1 TFLOPS) the compute performance of quad-core
CPU
• No large cache hierarchy to absorb memory requests
– GPU memory system is designed for throughput
• Wide bus (280 GB/sec)
• Repack/reorder/interleave memory requests to maximize use of memory
bus
• Still, this is only ten times the bandwidth available to CPU
64. Bandwidth thought experiment
A
Task: element-wise multiply two long vectors A and B Ă—
1.Load input A[i] B
+
2.Load input B[i] C
=
3.Load input C[i] D
4.Compute A[i] Ă— B[i] + C[i]
5.Store result into D[i]
Four memory operations (16 bytes) for every MUL-ADD
Radeon HD 7970 can do 2048 MUL-ADDS per clock
Need ~32 TB/sec of bandwidth to keep functional units busy
Less than 1% efficiency… but 6x faster than CPU!
65. Bandwidth limited!
If processors request data at too high a rate,
the memory system cannot keep up.
No amount of latency hiding helps this.
Overcoming bandwidth limits are a common challenge
for GPU-compute application developers.
66. Reducing bandwidth requirements
• Request data less often (instead, do more math)
– “arithmetic intensity”
• Fetch data from memory less often (share/reuse data
across fragments
– on-chip communication or storage
67. Reducing bandwidth requirements
• Two examples of on-chip storage
– Texture caches
– OpenCL “local memory” (CUDA shared memory)
1 2 3 4
Texture caches:
Capture reuse across
Texture data fragments, not temporal
reuse within a single
shader program
68. Modern GPU memory hierarchy
Fetch/
Decode
Texture cache
ALU 1 ALU 2 ALU 3 ALU 4 (read-only)
ALU 5 ALU 6 ALU 7 ALU 8 L2 cache
(~1 MB) Memory
Shared “local”
Execution storage
contexts or
(256 KB) L1 cache
(64 KB)
On-chip storage takes load off memory system.
Many developers calling for more cache-like storage
(particularly GPU-compute applications)
69. Don’t forget about offload cost…
• PCIe bandwidth/latency
– 8GB/s each direction in practice
– Attempt to pipeline/multi-buffer uploads and downloads
• Dispatch latency
– O(10) usec to dispatch from CPU to GPU
– This means offload cost if O(10M) instructions
69
70. Heterogeneous devices to the rescue ?
• Tighter integration of CPU and GPU style cores
– Reduce offload cost
– Reduce memory copies/transfers
– Power management
• Industry shifting rapidly in this direction
– AMD Fusion APUs – NVIDIA Tegra 3
– Intel SNB/IVY – Apple A4 and A5
–… – QUALCOMM Snapdragon
– TI OMAP
–… 70