SlideShare a Scribd company logo
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
1
GPU Programming
Lecture 7
Atomic Operations and
Histogramming
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
2
Objective
• To understand atomic operations
– Read-modify-write in parallel computation
– Use of atomic operations in CUDA
– Why atomic operations reduce memory system throughput
– How to avoid atomic operations in some parallel algorithms
• Histogramming as an example application of atomic
operations
– Basic histogram algorithm
– Privatization
A Common Collaboration Pattern
• Multiple bank tellers count the total amount of cash in
the safe
• Each grabs a pile and counts
• Have a central display of the running total
• Whenever someone finishes counting a pile, add the
subtotal of the pile to the running total
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
3
A Common Parallel Coordination
Pattern
• Multiple customer service agents serving customers
• Each customer gets a number
• A central display shows the number of the next
customer who will be served
• When an agent becomes available, he/she calls the
number and he/she adds 1 to the display
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
4
A Common Arbitration Pattern
• Multiple customers booking air tickets
• Each
– Brings up a flight seat map
– Decides on a seat
– Update the the seat map, mark the seat as taken
• A bad outcome
– Multiple passengers ended up booking the same seat
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
5
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
6
Atomic Operations
If Mem[x] was initially 0, what would the value of
Mem[x] be after threads 1 and 2 have completed?
– What does each thread get in their Old variable?
The answer may vary due to data races. To avoid
data races, you should use atomic operations
thread1: thread2:Old  Mem[x]
New  Old + 1
Mem[x]  New
Old  Mem[x]
New  Old + 1
Mem[x]  New
Timing Scenario #1
Time Thread 1 Thread 2
1 (0) Old  Mem[x]
2 (1) New  Old + 1
3 (1) Mem[x]  New
4 (1) Old  Mem[x]
5 (2) New  Old + 1
6 (2) Mem[x]  New
• Thread 1 Old = 0
• Thread 2 Old = 1
• Mem[x] = 2 after the sequence
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
7
Timing Scenario #2
Time Thread 1 Thread 2
1 (0) Old  Mem[x]
2 (1) New  Old + 1
3 (1) Mem[x]  New
4 (1) Old  Mem[x]
5 (2) New  Old + 1
6 (2) Mem[x]  New
• Thread 1 Old = 1
• Thread 2 Old = 0
• Mem[x] = 2 after the sequence
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
8
Timing Scenario #3
Time Thread 1 Thread 2
1 (0) Old  Mem[x]
2 (1) New  Old + 1
3 (0) Old  Mem[x]
4 (1) Mem[x]  New
5 (1) New  Old + 1
6 (1) Mem[x]  New
• Thread 1 Old = 0
• Thread 2 Old = 0
• Mem[x] = 1 after the sequence
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
9
Timing Scenario #4
Time Thread 1 Thread 2
1 (0) Old  Mem[x]
2 (1) New  Old + 1
3 (0) Old  Mem[x]
4 (1) Mem[x]  New
5 (1) New  Old + 1
6 (1) Mem[x]  New
• Thread 1 Old = 0
• Thread 2 Old = 0
• Mem[x] = 1 after the sequence
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
10
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
11
Atomic Operations –
To Ensure Good Outcomes
thread1:
thread2:Old  Mem[x]
New  Old + 1
Mem[x]  New
Old  Mem[x]
New  Old + 1
Mem[x]  New
thread1:
thread2:Old  Mem[x]
New  Old + 1
Mem[x]  New
Old  Mem[x]
New  Old + 1
Mem[x]  New
Or
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
12
Without Atomic Operations
thread1:
thread2: Old  Mem[x]
New  Old + 1
Mem[x]  New
Old  Mem[x]
New  Old + 1
Mem[x]  New
• Both threads receive 0
• Mem[x] becomes 1
Mem[x] initialized to 0
Atomic Operations in General
• Performed by a single ISA instruction on a memory
location address
– Read the old value, calculate a new value, and write the new
value to the location
• The hardware ensures that no other threads can access
the location until the atomic operation is complete
– Any other threads that access the location will typically be
held in a queue until its turn
– All threads perform the atomic operation serially
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
13
Atomic Operations in CUDA
• Function calls that are translated into single
instructions (a.k.a. intrinsics)
– Atomic add, sub, inc, dec, min, max, exch (exchange), CAS
(compare and swap)
– Read CUDA C programming Guide 6.5 for details
• Atomic Add
int atomicAdd(int* address, int val);
reads the 32-bit word old pointed to by address in global or
shared memory, computes (old + val), and stores the result
back to memory at the same address. The function returns old.
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
14
More Atomic Adds in CUDA
• Unsigned 32-bit integer atomic add
unsigned int atomicAdd(unsigned int* address, unsigned int val);
• Unsigned 64-bit integer atomic add
unsigned long long int atomicAdd(unsigned long long int*
address, unsigned long long int val);
• Single-precision floating-point atomic add (capability >
2.0)
– float atomicAdd(float* address, float val);
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
15
Histogramming
• A method for extracting notable features and patterns
from large data sets
– Feature extraction for object recognition in images
– Fraud detection in credit card transactions
– Correlating heavenly object movements in astrophysics
– …
• Basic histograms - for each element in the data set,
use the value to identify a “bin” to increment
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
16
A Histogram Example
• In phrase “Programming Massively Parallel
Processors” build a histogram of frequencies of each
letter
• A(4), C(1), E(1), G(1), …
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
17
• How do you do this in parallel?
– Have each thread to take a section of the input
– For each input letter, use atomic operations to build the
histogram
Iteration #1 – 1st letter in each section
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
18
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A B C D E
1
F G H I J K L M
2
N O P
1
Q R S T U V
Iteration #2 – 2nd letter in each section
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
19
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A
1
B C D E
1
F G H I J K L
1
M
3
N O P
1
Q R
1
S T U V
Iteration #3
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
20
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A
1
B C D E
1
F G H I
1
J K L
1
M
3
N O
1
P
1
Q R
1
S
1
T U V
Iteration #4
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
21
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A
1
B C D E
1
F G
1
H I
1
J K L
1
M
3
N
1
O
1
P
1
Q R
1
S
2
T U V
Iteration #5
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
22
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A
1
B C D E
1
F G
1
H I
1
J K L
1
M
3
N
1
O
1
P
2
Q R
2
S
2
T U V
What is wrong with the algorithm?
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
23
What is wrong with the algorithm?
• Reads from the input array are not coalesced
– Assign inputs to each thread in a strided pattern
– Adjacent threads process adjacent input letters
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
24
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A B C D E F G
1
H I J K L M N O
1
P
1
Q R
1
S T U V
Iteration 2
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
25
P R O G R A M M I N G M A V
I
S
S Y
L
E P
Thread 0 Thread 1 Thread 2 Thread 3
A
1
B C D E F G
1
H I J K L M
2
N O
1
P
1
Q R
2
S T U V
• All threads move to the next section of input
A Histogram Kernel
• The kernel receives a pointer to the input buffer
• Each thread process the input in a strided pattern
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
26
More on the Histogram Kernel
// All threads handle blockDim.x * gridDim.x
// consecutive elements
while (i < size) {
atomicAdd( &(histo[buffer[i]]), 1);
i += stride;
}
}
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
27
Atomic Operations on DRAM
• An atomic operation
starts with a read, with a
latency of a few hundred
cycles
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
28
Atomic Operations on DRAM
• An atomic operation
starts with a read, with a
latency of a few hundred
cycles
• The atomic operation
ends with a write, with a
latency of a few hundred
cycles
• During this whole time,
no one else can access
the location
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
29
Atomic Operations on DRAM
• Each Load-Modify-Store has two full memory access
delays
– All atomic operations on the same variable (RAM location)
are serialized
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
DRAM delay DRAM delay
transfer delay
internal routing
DRAM delay
transfer delay
internal routing
..
atomic operation N atomic operation N+1
time
30
Latency determines throughput of
atomic operations
• Throughput of an atomic operation is the rate at which
the application can execute an atomic operation on a
particular location.
• The rate is limited by the total latency of the read-
modify-write sequence, typically more than 1000
cycles for global memory (DRAM) locations.
• This means that if many threads attempt to do atomic
operation on the same location (contention), the
memory bandwidth is reduced to < 1/1000!
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
31
You may have a similar experience in
supermarket checkout
• Some customers realize that they missed an item after
they started to check out
• They run to the isle and get the item while the line
waits
– The rate of check is reduced due to the long latency of
running to the isle and back.
• Imagine a store where every customer starts the check
out before they even fetch any of the items
– The rate of the checkout will be 1 / (entire shopping time of
each customer)
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
32
Hardware Improvements (cont.)
• Atomic operations on Fermi L2 cache
– medium latency, but still serialized
– Global to all blocks
– “Free improvement” on Global Memory atomics
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
internal routing
..
atomic operation N atomic operation N+1
time
data transfer data transfer
33
Hardware Improvements
• Atomic operations on Shared Memory
– Very short latency, but still serialized
– Private to each thread block
– Need algorithm work by programmers (more later)
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
internal routing
..
atomic operation N atomic operation N+1
time
data transfer
34
Atomics in Shared Memory Requires
Privatization
• Create private copies of the histo[] array for each
thread block
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo)
{
__shared__ unsigned int histo_private[256];
if (threadIdx.x < 256) histo_private[threadidx.x] = 0;
__syncthreads();
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
35
Build Private Histogram
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &(private_histo[buffer[i]), 1);
i += stride;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
36
Build Final Histogram
// wait for all other threads in the block to finish
__syncthreads();
if (threadIdx.x < 256)
atomicAdd( &(histo[threadIdx.x]),
private_histo[threadIdx.x] );
}
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
37
More on Privatization
• Privatization is a powerful and frequently used
techniques for parallelizing applications
• The operation needs to be associative and
commutative
– Histogram add operation is associative and commutative
• The histogram size needs to be small
– Fits into shared memory
• What if the histogram is too large to privatize?
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-
2012
38
ANY MORE QUESTIONS
© David Kirk/NVIDIA and Wen-mei W. Hwu
ECE408/CS483/ECE498al, University of Illinois, 2007-2012
39

More Related Content

Similar to GPU_Lecture_7.pptx

AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)Fellowship at Vodafone FutureLab
 
MDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentationMDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentationChin Yun Yu
 
Critical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review TechniqueCritical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review TechniqueJomari Gingo Selibio
 
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdfNAWAZURREHMANAWAN
 
Chap02 data manipulation
Chap02   data manipulationChap02   data manipulation
Chap02 data manipulationZohair Pia
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010haythem_2015
 
Comtrade file manipulationver1
Comtrade file manipulationver1Comtrade file manipulationver1
Comtrade file manipulationver1Jay Gosalia
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Bw tree presentation
Bw tree presentationBw tree presentation
Bw tree presentationDaeIn Lee
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2Omar Ahmed
 
Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...Zuken
 
Comparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation techniqueComparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation techniqueIAEME Publication
 
Lecture 1 introduction to microcontroller systems
Lecture 1   introduction to microcontroller systemsLecture 1   introduction to microcontroller systems
Lecture 1 introduction to microcontroller systemsfawaz yaseen
 
Small Computer Network - Student Task
Small Computer Network - Student TaskSmall Computer Network - Student Task
Small Computer Network - Student Taskedsonjacobs
 
Software Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine DesignSoftware Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine Designvivatechijri
 

Similar to GPU_Lecture_7.pptx (20)

AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
MDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentationMDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentation
 
Critical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review TechniqueCritical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review Technique
 
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
 
Chap02 data manipulation
Chap02   data manipulationChap02   data manipulation
Chap02 data manipulation
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
 
Comtrade file manipulationver1
Comtrade file manipulationver1Comtrade file manipulationver1
Comtrade file manipulationver1
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Bw tree presentation
Bw tree presentationBw tree presentation
Bw tree presentation
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2
 
SmartSensors, Piezotory, IDM8
SmartSensors, Piezotory, IDM8SmartSensors, Piezotory, IDM8
SmartSensors, Piezotory, IDM8
 
Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...
 
Maquina estado
Maquina estadoMaquina estado
Maquina estado
 
Comparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation techniqueComparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation technique
 
Lecture 1 introduction to microcontroller systems
Lecture 1   introduction to microcontroller systemsLecture 1   introduction to microcontroller systems
Lecture 1 introduction to microcontroller systems
 
Small Computer Network - Student Task
Small Computer Network - Student TaskSmall Computer Network - Student Task
Small Computer Network - Student Task
 
Secure 2 Party AES
Secure 2 Party AESSecure 2 Party AES
Secure 2 Party AES
 
Thesis_Misal
Thesis_MisalThesis_Misal
Thesis_Misal
 
Software Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine DesignSoftware Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine Design
 
6 osi vimp
6 osi vimp6 osi vimp
6 osi vimp
 

Recently uploaded

KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfMeon Technology
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 

Recently uploaded (20)

KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 

GPU_Lecture_7.pptx

  • 1. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 1 GPU Programming Lecture 7 Atomic Operations and Histogramming
  • 2. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 2 Objective • To understand atomic operations – Read-modify-write in parallel computation – Use of atomic operations in CUDA – Why atomic operations reduce memory system throughput – How to avoid atomic operations in some parallel algorithms • Histogramming as an example application of atomic operations – Basic histogram algorithm – Privatization
  • 3. A Common Collaboration Pattern • Multiple bank tellers count the total amount of cash in the safe • Each grabs a pile and counts • Have a central display of the running total • Whenever someone finishes counting a pile, add the subtotal of the pile to the running total © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 3
  • 4. A Common Parallel Coordination Pattern • Multiple customer service agents serving customers • Each customer gets a number • A central display shows the number of the next customer who will be served • When an agent becomes available, he/she calls the number and he/she adds 1 to the display © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 4
  • 5. A Common Arbitration Pattern • Multiple customers booking air tickets • Each – Brings up a flight seat map – Decides on a seat – Update the the seat map, mark the seat as taken • A bad outcome – Multiple passengers ended up booking the same seat © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 5
  • 6. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 6 Atomic Operations If Mem[x] was initially 0, what would the value of Mem[x] be after threads 1 and 2 have completed? – What does each thread get in their Old variable? The answer may vary due to data races. To avoid data races, you should use atomic operations thread1: thread2:Old  Mem[x] New  Old + 1 Mem[x]  New Old  Mem[x] New  Old + 1 Mem[x]  New
  • 7. Timing Scenario #1 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 (1) Mem[x]  New 4 (1) Old  Mem[x] 5 (2) New  Old + 1 6 (2) Mem[x]  New • Thread 1 Old = 0 • Thread 2 Old = 1 • Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 7
  • 8. Timing Scenario #2 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 (1) Mem[x]  New 4 (1) Old  Mem[x] 5 (2) New  Old + 1 6 (2) Mem[x]  New • Thread 1 Old = 1 • Thread 2 Old = 0 • Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 8
  • 9. Timing Scenario #3 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 (0) Old  Mem[x] 4 (1) Mem[x]  New 5 (1) New  Old + 1 6 (1) Mem[x]  New • Thread 1 Old = 0 • Thread 2 Old = 0 • Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 9
  • 10. Timing Scenario #4 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 (0) Old  Mem[x] 4 (1) Mem[x]  New 5 (1) New  Old + 1 6 (1) Mem[x]  New • Thread 1 Old = 0 • Thread 2 Old = 0 • Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 10
  • 11. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 11 Atomic Operations – To Ensure Good Outcomes thread1: thread2:Old  Mem[x] New  Old + 1 Mem[x]  New Old  Mem[x] New  Old + 1 Mem[x]  New thread1: thread2:Old  Mem[x] New  Old + 1 Mem[x]  New Old  Mem[x] New  Old + 1 Mem[x]  New Or
  • 12. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 12 Without Atomic Operations thread1: thread2: Old  Mem[x] New  Old + 1 Mem[x]  New Old  Mem[x] New  Old + 1 Mem[x]  New • Both threads receive 0 • Mem[x] becomes 1 Mem[x] initialized to 0
  • 13. Atomic Operations in General • Performed by a single ISA instruction on a memory location address – Read the old value, calculate a new value, and write the new value to the location • The hardware ensures that no other threads can access the location until the atomic operation is complete – Any other threads that access the location will typically be held in a queue until its turn – All threads perform the atomic operation serially © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 13
  • 14. Atomic Operations in CUDA • Function calls that are translated into single instructions (a.k.a. intrinsics) – Atomic add, sub, inc, dec, min, max, exch (exchange), CAS (compare and swap) – Read CUDA C programming Guide 6.5 for details • Atomic Add int atomicAdd(int* address, int val); reads the 32-bit word old pointed to by address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. The function returns old. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 14
  • 15. More Atomic Adds in CUDA • Unsigned 32-bit integer atomic add unsigned int atomicAdd(unsigned int* address, unsigned int val); • Unsigned 64-bit integer atomic add unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val); • Single-precision floating-point atomic add (capability > 2.0) – float atomicAdd(float* address, float val); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 15
  • 16. Histogramming • A method for extracting notable features and patterns from large data sets – Feature extraction for object recognition in images – Fraud detection in credit card transactions – Correlating heavenly object movements in astrophysics – … • Basic histograms - for each element in the data set, use the value to identify a “bin” to increment © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 16
  • 17. A Histogram Example • In phrase “Programming Massively Parallel Processors” build a histogram of frequencies of each letter • A(4), C(1), E(1), G(1), … © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 17 • How do you do this in parallel? – Have each thread to take a section of the input – For each input letter, use atomic operations to build the histogram
  • 18. Iteration #1 – 1st letter in each section © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 18 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A B C D E 1 F G H I J K L M 2 N O P 1 Q R S T U V
  • 19. Iteration #2 – 2nd letter in each section © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 19 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G H I J K L 1 M 3 N O P 1 Q R 1 S T U V
  • 20. Iteration #3 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 20 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G H I 1 J K L 1 M 3 N O 1 P 1 Q R 1 S 1 T U V
  • 21. Iteration #4 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 21 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G 1 H I 1 J K L 1 M 3 N 1 O 1 P 1 Q R 1 S 2 T U V
  • 22. Iteration #5 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 22 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G 1 H I 1 J K L 1 M 3 N 1 O 1 P 2 Q R 2 S 2 T U V
  • 23. What is wrong with the algorithm? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 23
  • 24. What is wrong with the algorithm? • Reads from the input array are not coalesced – Assign inputs to each thread in a strided pattern – Adjacent threads process adjacent input letters © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 24 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A B C D E F G 1 H I J K L M N O 1 P 1 Q R 1 S T U V
  • 25. Iteration 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 25 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E F G 1 H I J K L M 2 N O 1 P 1 Q R 2 S T U V • All threads move to the next section of input
  • 26. A Histogram Kernel • The kernel receives a pointer to the input buffer • Each thread process the input in a strided pattern __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 26
  • 27. More on the Histogram Kernel // All threads handle blockDim.x * gridDim.x // consecutive elements while (i < size) { atomicAdd( &(histo[buffer[i]]), 1); i += stride; } } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 27
  • 28. Atomic Operations on DRAM • An atomic operation starts with a read, with a latency of a few hundred cycles © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 28
  • 29. Atomic Operations on DRAM • An atomic operation starts with a read, with a latency of a few hundred cycles • The atomic operation ends with a write, with a latency of a few hundred cycles • During this whole time, no one else can access the location © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 29
  • 30. Atomic Operations on DRAM • Each Load-Modify-Store has two full memory access delays – All atomic operations on the same variable (RAM location) are serialized © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 DRAM delay DRAM delay transfer delay internal routing DRAM delay transfer delay internal routing .. atomic operation N atomic operation N+1 time 30
  • 31. Latency determines throughput of atomic operations • Throughput of an atomic operation is the rate at which the application can execute an atomic operation on a particular location. • The rate is limited by the total latency of the read- modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations. • This means that if many threads attempt to do atomic operation on the same location (contention), the memory bandwidth is reduced to < 1/1000! © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 31
  • 32. You may have a similar experience in supermarket checkout • Some customers realize that they missed an item after they started to check out • They run to the isle and get the item while the line waits – The rate of check is reduced due to the long latency of running to the isle and back. • Imagine a store where every customer starts the check out before they even fetch any of the items – The rate of the checkout will be 1 / (entire shopping time of each customer) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 32
  • 33. Hardware Improvements (cont.) • Atomic operations on Fermi L2 cache – medium latency, but still serialized – Global to all blocks – “Free improvement” on Global Memory atomics © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 internal routing .. atomic operation N atomic operation N+1 time data transfer data transfer 33
  • 34. Hardware Improvements • Atomic operations on Shared Memory – Very short latency, but still serialized – Private to each thread block – Need algorithm work by programmers (more later) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 internal routing .. atomic operation N atomic operation N+1 time data transfer 34
  • 35. Atomics in Shared Memory Requires Privatization • Create private copies of the histo[] array for each thread block __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { __shared__ unsigned int histo_private[256]; if (threadIdx.x < 256) histo_private[threadidx.x] = 0; __syncthreads(); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 35
  • 36. Build Private Histogram int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; while (i < size) { atomicAdd( &(private_histo[buffer[i]), 1); i += stride; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 36
  • 37. Build Final Histogram // wait for all other threads in the block to finish __syncthreads(); if (threadIdx.x < 256) atomicAdd( &(histo[threadIdx.x]), private_histo[threadIdx.x] ); } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 37
  • 38. More on Privatization • Privatization is a powerful and frequently used techniques for parallelizing applications • The operation needs to be associative and commutative – Histogram add operation is associative and commutative • The histogram size needs to be small – Fits into shared memory • What if the histogram is too large to privatize? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 38
  • 39. ANY MORE QUESTIONS © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 39