Submit Search
Upload
GPU_Lecture_7.pptx
•
Download as PPTX, PDF
•
0 likes
•
11 views
S
ssuser93ee14
Follow
gpu lecture research
Read less
Read more
Software
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 39
Download now
Recommended
Lecture5 cuda-memory-spring-2010
Lecture5 cuda-memory-spring-2010
douglaslyon
“New Methods for Implementation of 2-D Convolution for Convolutional Neural N...
“New Methods for Implementation of 2-D Convolution for Convolutional Neural N...
Edge AI and Vision Alliance
1LocationFixed CostsVariable Costs per unitA=BB=CC=DA$85,000260006.docx
1LocationFixed CostsVariable Costs per unitA=BB=CC=DA$85,000260006.docx
drennanmicah
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Fellowship at Vodafone FutureLab
Data acquisition and storage in Wireless Sensor Network
Data acquisition and storage in Wireless Sensor Network
Rutvik Pensionwar
Using GPUs for parallel processing
Using GPUs for parallel processing
asm100
Topic 1 introduction
Topic 1 introduction
SangeethaBg
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
NibrasulIslam
Recommended
Lecture5 cuda-memory-spring-2010
Lecture5 cuda-memory-spring-2010
douglaslyon
“New Methods for Implementation of 2-D Convolution for Convolutional Neural N...
“New Methods for Implementation of 2-D Convolution for Convolutional Neural N...
Edge AI and Vision Alliance
1LocationFixed CostsVariable Costs per unitA=BB=CC=DA$85,000260006.docx
1LocationFixed CostsVariable Costs per unitA=BB=CC=DA$85,000260006.docx
drennanmicah
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Fellowship at Vodafone FutureLab
Data acquisition and storage in Wireless Sensor Network
Data acquisition and storage in Wireless Sensor Network
Rutvik Pensionwar
Using GPUs for parallel processing
Using GPUs for parallel processing
asm100
Topic 1 introduction
Topic 1 introduction
SangeethaBg
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
NibrasulIslam
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Fellowship at Vodafone FutureLab
MDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentation
Chin Yun Yu
Critical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review Technique
Jomari Gingo Selibio
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
NAWAZURREHMANAWAN
Chap02 data manipulation
Chap02 data manipulation
Zohair Pia
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
haythem_2015
Comtrade file manipulationver1
Comtrade file manipulationver1
Jay Gosalia
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
NVIDIA Taiwan
Bw tree presentation
Bw tree presentation
DaeIn Lee
Asufe juniors-training session2
Asufe juniors-training session2
Omar Ahmed
SmartSensors, Piezotory, IDM8
SmartSensors, Piezotory, IDM8
Qatar University- Young Scientists Center (Al-Bairaq)
Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...
Zuken
Maquina estado
Maquina estado
Cesar Gil Arrieta
Comparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation technique
IAEME Publication
Lecture 1 introduction to microcontroller systems
Lecture 1 introduction to microcontroller systems
fawaz yaseen
Small Computer Network - Student Task
Small Computer Network - Student Task
edsonjacobs
Secure 2 Party AES
Secure 2 Party AES
JITENDRA KUMAR PATEL
Thesis_Misal
Thesis_Misal
Chaitanya Misal
Software Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine Design
vivatechijri
6 osi vimp
6 osi vimp
Vinod Jadhav
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
Neo4j
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Andrea Goulet
More Related Content
Similar to GPU_Lecture_7.pptx
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Fellowship at Vodafone FutureLab
MDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentation
Chin Yun Yu
Critical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review Technique
Jomari Gingo Selibio
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
NAWAZURREHMANAWAN
Chap02 data manipulation
Chap02 data manipulation
Zohair Pia
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
haythem_2015
Comtrade file manipulationver1
Comtrade file manipulationver1
Jay Gosalia
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
NVIDIA Taiwan
Bw tree presentation
Bw tree presentation
DaeIn Lee
Asufe juniors-training session2
Asufe juniors-training session2
Omar Ahmed
SmartSensors, Piezotory, IDM8
SmartSensors, Piezotory, IDM8
Qatar University- Young Scientists Center (Al-Bairaq)
Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...
Zuken
Maquina estado
Maquina estado
Cesar Gil Arrieta
Comparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation technique
IAEME Publication
Lecture 1 introduction to microcontroller systems
Lecture 1 introduction to microcontroller systems
fawaz yaseen
Small Computer Network - Student Task
Small Computer Network - Student Task
edsonjacobs
Secure 2 Party AES
Secure 2 Party AES
JITENDRA KUMAR PATEL
Thesis_Misal
Thesis_Misal
Chaitanya Misal
Software Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine Design
vivatechijri
6 osi vimp
6 osi vimp
Vinod Jadhav
Similar to GPU_Lecture_7.pptx
(20)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
MDX challenge 2021 town hall presentation
MDX challenge 2021 town hall presentation
Critical Path Method/ Program Evaluation and Review Technique
Critical Path Method/ Program Evaluation and Review Technique
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
4-BlockCipher-DES-CEN451-BSE-Spring2022-17042022-104521am.pdf
Chap02 data manipulation
Chap02 data manipulation
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
Comtrade file manipulationver1
Comtrade file manipulationver1
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Bw tree presentation
Bw tree presentation
Asufe juniors-training session2
Asufe juniors-training session2
SmartSensors, Piezotory, IDM8
SmartSensors, Piezotory, IDM8
Implementing the latest embedded component technology from concept-to-manufac...
Implementing the latest embedded component technology from concept-to-manufac...
Maquina estado
Maquina estado
Comparative analysis of multi stage cordic using micro rotation technique
Comparative analysis of multi stage cordic using micro rotation technique
Lecture 1 introduction to microcontroller systems
Lecture 1 introduction to microcontroller systems
Small Computer Network - Student Task
Small Computer Network - Student Task
Secure 2 Party AES
Secure 2 Party AES
Thesis_Misal
Thesis_Misal
Software Based calculations of Electrical Machine Design
Software Based calculations of Electrical Machine Design
6 osi vimp
6 osi vimp
Recently uploaded
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
Neo4j
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Andrea Goulet
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
vaibhav130304
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
Canary7-Warehouse Management System
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
Soroosh Khodami
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
Meon Technology
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
kalichargn70th171
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Gáspár Nagy
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
kalichargn70th171
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
Neo4j
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
KnowledgeSeed
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
Shane Coughlan
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
XongoLab Technologies LLP
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
Neo4j
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
Wave PLM
Recently uploaded
(20)
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
GPU_Lecture_7.pptx
1.
© David Kirk/NVIDIA
and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 1 GPU Programming Lecture 7 Atomic Operations and Histogramming
2.
© David Kirk/NVIDIA
and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 2 Objective • To understand atomic operations – Read-modify-write in parallel computation – Use of atomic operations in CUDA – Why atomic operations reduce memory system throughput – How to avoid atomic operations in some parallel algorithms • Histogramming as an example application of atomic operations – Basic histogram algorithm – Privatization
3.
A Common Collaboration
Pattern • Multiple bank tellers count the total amount of cash in the safe • Each grabs a pile and counts • Have a central display of the running total • Whenever someone finishes counting a pile, add the subtotal of the pile to the running total © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 3
4.
A Common Parallel
Coordination Pattern • Multiple customer service agents serving customers • Each customer gets a number • A central display shows the number of the next customer who will be served • When an agent becomes available, he/she calls the number and he/she adds 1 to the display © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 4
5.
A Common Arbitration
Pattern • Multiple customers booking air tickets • Each – Brings up a flight seat map – Decides on a seat – Update the the seat map, mark the seat as taken • A bad outcome – Multiple passengers ended up booking the same seat © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 5
6.
© David Kirk/NVIDIA
and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 6 Atomic Operations If Mem[x] was initially 0, what would the value of Mem[x] be after threads 1 and 2 have completed? – What does each thread get in their Old variable? The answer may vary due to data races. To avoid data races, you should use atomic operations thread1: thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New
7.
Timing Scenario #1 Time
Thread 1 Thread 2 1 (0) Old Mem[x] 2 (1) New Old + 1 3 (1) Mem[x] New 4 (1) Old Mem[x] 5 (2) New Old + 1 6 (2) Mem[x] New • Thread 1 Old = 0 • Thread 2 Old = 1 • Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 7
8.
Timing Scenario #2 Time
Thread 1 Thread 2 1 (0) Old Mem[x] 2 (1) New Old + 1 3 (1) Mem[x] New 4 (1) Old Mem[x] 5 (2) New Old + 1 6 (2) Mem[x] New • Thread 1 Old = 1 • Thread 2 Old = 0 • Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 8
9.
Timing Scenario #3 Time
Thread 1 Thread 2 1 (0) Old Mem[x] 2 (1) New Old + 1 3 (0) Old Mem[x] 4 (1) Mem[x] New 5 (1) New Old + 1 6 (1) Mem[x] New • Thread 1 Old = 0 • Thread 2 Old = 0 • Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 9
10.
Timing Scenario #4 Time
Thread 1 Thread 2 1 (0) Old Mem[x] 2 (1) New Old + 1 3 (0) Old Mem[x] 4 (1) Mem[x] New 5 (1) New Old + 1 6 (1) Mem[x] New • Thread 1 Old = 0 • Thread 2 Old = 0 • Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 10
11.
© David Kirk/NVIDIA
and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 11 Atomic Operations – To Ensure Good Outcomes thread1: thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New thread1: thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New Or
12.
© David Kirk/NVIDIA
and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 12 Without Atomic Operations thread1: thread2: Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New • Both threads receive 0 • Mem[x] becomes 1 Mem[x] initialized to 0
13.
Atomic Operations in
General • Performed by a single ISA instruction on a memory location address – Read the old value, calculate a new value, and write the new value to the location • The hardware ensures that no other threads can access the location until the atomic operation is complete – Any other threads that access the location will typically be held in a queue until its turn – All threads perform the atomic operation serially © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 13
14.
Atomic Operations in
CUDA • Function calls that are translated into single instructions (a.k.a. intrinsics) – Atomic add, sub, inc, dec, min, max, exch (exchange), CAS (compare and swap) – Read CUDA C programming Guide 6.5 for details • Atomic Add int atomicAdd(int* address, int val); reads the 32-bit word old pointed to by address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. The function returns old. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 14
15.
More Atomic Adds
in CUDA • Unsigned 32-bit integer atomic add unsigned int atomicAdd(unsigned int* address, unsigned int val); • Unsigned 64-bit integer atomic add unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val); • Single-precision floating-point atomic add (capability > 2.0) – float atomicAdd(float* address, float val); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 15
16.
Histogramming • A method
for extracting notable features and patterns from large data sets – Feature extraction for object recognition in images – Fraud detection in credit card transactions – Correlating heavenly object movements in astrophysics – … • Basic histograms - for each element in the data set, use the value to identify a “bin” to increment © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 16
17.
A Histogram Example •
In phrase “Programming Massively Parallel Processors” build a histogram of frequencies of each letter • A(4), C(1), E(1), G(1), … © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 17 • How do you do this in parallel? – Have each thread to take a section of the input – For each input letter, use atomic operations to build the histogram
18.
Iteration #1 –
1st letter in each section © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 18 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A B C D E 1 F G H I J K L M 2 N O P 1 Q R S T U V
19.
Iteration #2 –
2nd letter in each section © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 19 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G H I J K L 1 M 3 N O P 1 Q R 1 S T U V
20.
Iteration #3 © David
Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 20 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G H I 1 J K L 1 M 3 N O 1 P 1 Q R 1 S 1 T U V
21.
Iteration #4 © David
Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 21 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G 1 H I 1 J K L 1 M 3 N 1 O 1 P 1 Q R 1 S 2 T U V
22.
Iteration #5 © David
Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 22 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E 1 F G 1 H I 1 J K L 1 M 3 N 1 O 1 P 2 Q R 2 S 2 T U V
23.
What is wrong
with the algorithm? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 23
24.
What is wrong
with the algorithm? • Reads from the input array are not coalesced – Assign inputs to each thread in a strided pattern – Adjacent threads process adjacent input letters © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 24 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A B C D E F G 1 H I J K L M N O 1 P 1 Q R 1 S T U V
25.
Iteration 2 © David
Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 25 P R O G R A M M I N G M A V I S S Y L E P Thread 0 Thread 1 Thread 2 Thread 3 A 1 B C D E F G 1 H I J K L M 2 N O 1 P 1 Q R 2 S T U V • All threads move to the next section of input
26.
A Histogram Kernel •
The kernel receives a pointer to the input buffer • Each thread process the input in a strided pattern __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 26
27.
More on the
Histogram Kernel // All threads handle blockDim.x * gridDim.x // consecutive elements while (i < size) { atomicAdd( &(histo[buffer[i]]), 1); i += stride; } } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 27
28.
Atomic Operations on
DRAM • An atomic operation starts with a read, with a latency of a few hundred cycles © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 28
29.
Atomic Operations on
DRAM • An atomic operation starts with a read, with a latency of a few hundred cycles • The atomic operation ends with a write, with a latency of a few hundred cycles • During this whole time, no one else can access the location © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 29
30.
Atomic Operations on
DRAM • Each Load-Modify-Store has two full memory access delays – All atomic operations on the same variable (RAM location) are serialized © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 DRAM delay DRAM delay transfer delay internal routing DRAM delay transfer delay internal routing .. atomic operation N atomic operation N+1 time 30
31.
Latency determines throughput
of atomic operations • Throughput of an atomic operation is the rate at which the application can execute an atomic operation on a particular location. • The rate is limited by the total latency of the read- modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations. • This means that if many threads attempt to do atomic operation on the same location (contention), the memory bandwidth is reduced to < 1/1000! © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 31
32.
You may have
a similar experience in supermarket checkout • Some customers realize that they missed an item after they started to check out • They run to the isle and get the item while the line waits – The rate of check is reduced due to the long latency of running to the isle and back. • Imagine a store where every customer starts the check out before they even fetch any of the items – The rate of the checkout will be 1 / (entire shopping time of each customer) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 32
33.
Hardware Improvements (cont.) •
Atomic operations on Fermi L2 cache – medium latency, but still serialized – Global to all blocks – “Free improvement” on Global Memory atomics © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 internal routing .. atomic operation N atomic operation N+1 time data transfer data transfer 33
34.
Hardware Improvements • Atomic
operations on Shared Memory – Very short latency, but still serialized – Private to each thread block – Need algorithm work by programmers (more later) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 internal routing .. atomic operation N atomic operation N+1 time data transfer 34
35.
Atomics in Shared
Memory Requires Privatization • Create private copies of the histo[] array for each thread block __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { __shared__ unsigned int histo_private[256]; if (threadIdx.x < 256) histo_private[threadidx.x] = 0; __syncthreads(); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 35
36.
Build Private Histogram int
i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; while (i < size) { atomicAdd( &(private_histo[buffer[i]), 1); i += stride; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 36
37.
Build Final Histogram //
wait for all other threads in the block to finish __syncthreads(); if (threadIdx.x < 256) atomicAdd( &(histo[threadIdx.x]), private_histo[threadIdx.x] ); } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 37
38.
More on Privatization •
Privatization is a powerful and frequently used techniques for parallelizing applications • The operation needs to be associative and commutative – Histogram add operation is associative and commutative • The histogram size needs to be small – Fits into shared memory • What if the histogram is too large to privatize? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007- 2012 38
39.
ANY MORE QUESTIONS ©
David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 39
Download now