SlideShare a Scribd company logo
Understanding GPU Memory Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes This lecture begins with an example of how a wide-memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts are also described along with examples on current hardware This lecture is important because achieving high memory bandwidth is often the most important requirement for many GPU applications 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics Introduction to GPU bus addressing Coalescing memory accesses Introduction to memory banks Memory bank conflicts 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example Array X is a pointer to an array of integers (4-bytes each) located at address 0x00001232 A thread wants to access the data at X[0] inttmp = X[0]; 0x00001232 0 0x00001236 1 2 X 0x00001232 3 . . . 4 5 6 0x00001248 7 . . . . 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bus Addressing Assume that the memory bus is 32-bytes (256-bits) wide This is the width on a Radeon 5870 GPU The byte-addressable bus must make accesses that are aligned to the bus width, so the bottom 5 bits are masked off Any access in the range 0x00001220 to 0x0000123F will produce the address 0x00001220  5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bus Addressing All data in the range 0x00001220 to 0x0000123F is returned on the bus In this case, 4 bytes are useful and 28 bytes are wasted Memory Controller Send an address Receive data Desired element ? ? ? 0 1 2 3 4 0x00001220 0x00001220 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Coalescing Memory Accesses To fully utilize the bus, GPUs combine the accesses of multiple threads into fewer requests when possible Consider the following OpenCL kernel code: Assuming that array X is the same array from the example, the first 16 threads will access addresses 0x00001232 through 0x00001272 If each request was sent out individually, there would be 16 accesses total, with 64 useful bytes of data and 448 wasted bytes Notice that each access in the same 32-byte range would return exactly the same data  inttmp = X[get_global_id(0)];  7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Coalescing Memory Accesses When GPU threads access data in the same 32-byte range, multiple accesses are combined so that each range is only accessed once Combining accesses is called coalescing For this example, only 3 accesses are required If the start of the array was 256-bit aligned, only two accesses would be required ? ? ? 0 1 2 3 4 0x00001220 5 6 7 8 9 10 11 12 0x00001240 13 14 15 ? ? ? ? ? 0x00001260 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Coalescing Memory Accesses Recall that for AMD hardware, 64 threads are part of a wavefront and must execute the same instruction in a SIMD manner For the AMD 5870 GPU, memory accesses of 16 consecutive threads are evaluated together and can be coalesced to fully utilize the bus This unit is called a quarter-wavefront and is the important hardware scheduling unit for memory accesses 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Coalescing Memory Accesses Global memory performance for a simple data copying kernel of entirely coalesced and entirely non-coalesced accesses on an ATI Radeon 5870 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Coalescing Memory Accesses Global memory performance for a simple data copying kernel of entirely coalesced and entirely non-coalesced accesses on an NVIDIA GTX 285 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Memory Banks Memory is made up of banks  Memory banks are the hardware units that actually store data The memory banks targeted by a memory access depend on the address of the data to be read/written Note that on current GPUs, there are more memory banks than can be addressed at once by the global memory bus, so it is possible for different accesses to target different banks Bank response time, not access requests, is the bottleneck Successive data are stored in successive banks (strides of 32-bit words on GPUs) so that a group of threads accessing successive elements will produce no bank conflicts 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bank Conflicts – Local Memory Bank conflicts have the largest negative effect on local memory operations Local memory does not require that accesses are to sequentially increasing elements Accesses from successive threads should target different memory banks Threads accessing sequentially increasing data will fall into this category 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bank Conflicts – Local Memory On AMD, a wavefront that generates bank conflicts stalls until all local memory operations complete The hardware does not hide the stall by switching to another wavefront The following examples show local memory access patterns and whether conflicts are generated For readability, only 8 memory banks are shown 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bank Conflicts – Local Memory If there are no bank conflicts, each bank can return an element without any delays Both of the following patterns will complete without stalls on current GPU hardware 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Thread Memory Bank Thread Memory Bank 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bank Conflicts – Local Memory If multiple accesses occur to the same bank, then the bank with the most conflicts will determine the latency The following pattern will take 3 times the access latency to complete Conflicts 0 0 1 1 2 1 2 2 3 3 3 4 4 5 5 1 6 6 1 7 7 Thread Memory Bank 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bank Conflicts – Local Memory If all accesses are to the same address, then the bank can perform a broadcast and no delay is incurred The following will only take one access to complete assuming the same data element is accessed 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Thread Memory Bank 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Bank Conflicts – Global Memory Bank conflicts in global memory rely on the same principles, however the global memory bus makes the impact of conflicts more subtle  Since accessing data in global memory requires that an entire bus-line be read, bank conflicts within a work-group have a similar effect as non-coalesced accesses If threads reading from global memory had a bank conflict then by definition it manifest as a non-coalesced access Not all non-coalesced accesses are bank conflicts, however The ideal case for global memory is when different work-groups read from different banks In reality, this is a very low-level optimization and should not be prioritized when first writing a program 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary GPU memory is different than CPU memory The goal is high throughput instead of low-latency Memory access patterns have a huge impact on bus utilization Low utilization means low performance Having coalesced memory accesses and avoiding bank conflicts are required for high performance code Specific hardware information (such as bus width, number of memory banks, and number of threads that coalesce memory requests) is GPU-specific and can be found in vendor documentation 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

More Related Content

What's hot

Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
Tejhaskar Ashok Kumar
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerFörderverein Technische Fakultät
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
journalBEEI
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
Amritesh Srivastava
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
Savith Satheesh
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06
Pankaj Debbarma
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
cscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
Apache MXNet
 
CUDA
CUDACUDA
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
cseij
 

What's hot (20)

Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
PowerAlluxio
PowerAlluxioPowerAlluxio
PowerAlluxio
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
CUDA
CUDACUDA
CUDA
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 

Similar to Lec06 memory

Chapter 8 memory-updated
Chapter 8 memory-updatedChapter 8 memory-updated
Chapter 8 memory-updated
Delowar hossain
 
Project Earl Grey
Project Earl GreyProject Earl Grey
Project Earl Grey
Jaehoon Choi
 
CS304PC:Computer Organization and Architecture Session 29 Memory organization...
CS304PC:Computer Organization and Architecture Session 29 Memory organization...CS304PC:Computer Organization and Architecture Session 29 Memory organization...
CS304PC:Computer Organization and Architecture Session 29 Memory organization...
Asst.prof M.Gokilavani
 
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
AIRCC Publishing Corporation
 
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
AIRCC Publishing Corporation
 
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECKA SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
AIRCC Publishing Corporation
 
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
ijcsit
 
Reliable Hydra SSD Architecture for General Purpose Controllers
Reliable Hydra SSD Architecture for General Purpose ControllersReliable Hydra SSD Architecture for General Purpose Controllers
Reliable Hydra SSD Architecture for General Purpose Controllers
IJMER
 
Chapter 8 : Memory
Chapter 8 : MemoryChapter 8 : Memory
Chapter 8 : Memory
Amin Omi
 
lecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memorylecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memory
floraaluoch3
 
Advances in Computer Hardware.pdf
Advances in Computer Hardware.pdfAdvances in Computer Hardware.pdf
Advances in Computer Hardware.pdf
Dr. Manjunatha. P
 
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptxonur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
sivasubramanianManic2
 
What Every Programmer Should Know About Memory
What Every Programmer Should Know About MemoryWhat Every Programmer Should Know About Memory
What Every Programmer Should Know About MemoryYing wei (Joe) Chou
 
Memory management
Memory managementMemory management
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
Kathirvel Ayyaswamy
 
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
Kathirvel Ayyaswamy
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
Dilum Bandara
 
Computer organisation ppt
Computer organisation pptComputer organisation ppt
Computer organisation ppt
chandkec
 

Similar to Lec06 memory (20)

Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Chapter 8 memory-updated
Chapter 8 memory-updatedChapter 8 memory-updated
Chapter 8 memory-updated
 
NUMA
NUMANUMA
NUMA
 
Project Earl Grey
Project Earl GreyProject Earl Grey
Project Earl Grey
 
CS304PC:Computer Organization and Architecture Session 29 Memory organization...
CS304PC:Computer Organization and Architecture Session 29 Memory organization...CS304PC:Computer Organization and Architecture Session 29 Memory organization...
CS304PC:Computer Organization and Architecture Session 29 Memory organization...
 
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
 
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
 
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECKA SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
 
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
 
Reliable Hydra SSD Architecture for General Purpose Controllers
Reliable Hydra SSD Architecture for General Purpose ControllersReliable Hydra SSD Architecture for General Purpose Controllers
Reliable Hydra SSD Architecture for General Purpose Controllers
 
Chapter 8 : Memory
Chapter 8 : MemoryChapter 8 : Memory
Chapter 8 : Memory
 
lecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memorylecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memory
 
Advances in Computer Hardware.pdf
Advances in Computer Hardware.pdfAdvances in Computer Hardware.pdf
Advances in Computer Hardware.pdf
 
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptxonur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
 
What Every Programmer Should Know About Memory
What Every Programmer Should Know About MemoryWhat Every Programmer Should Know About Memory
What Every Programmer Should Know About Memory
 
Memory management
Memory managementMemory management
Memory management
 
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
 
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Computer organisation ppt
Computer organisation pptComputer organisation ppt
Computer organisation ppt
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Lec06 memory

  • 1. Understanding GPU Memory Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes This lecture begins with an example of how a wide-memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts are also described along with examples on current hardware This lecture is important because achieving high memory bandwidth is often the most important requirement for many GPU applications 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics Introduction to GPU bus addressing Coalescing memory accesses Introduction to memory banks Memory bank conflicts 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Example Array X is a pointer to an array of integers (4-bytes each) located at address 0x00001232 A thread wants to access the data at X[0] inttmp = X[0]; 0x00001232 0 0x00001236 1 2 X 0x00001232 3 . . . 4 5 6 0x00001248 7 . . . . 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Bus Addressing Assume that the memory bus is 32-bytes (256-bits) wide This is the width on a Radeon 5870 GPU The byte-addressable bus must make accesses that are aligned to the bus width, so the bottom 5 bits are masked off Any access in the range 0x00001220 to 0x0000123F will produce the address 0x00001220 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Bus Addressing All data in the range 0x00001220 to 0x0000123F is returned on the bus In this case, 4 bytes are useful and 28 bytes are wasted Memory Controller Send an address Receive data Desired element ? ? ? 0 1 2 3 4 0x00001220 0x00001220 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Coalescing Memory Accesses To fully utilize the bus, GPUs combine the accesses of multiple threads into fewer requests when possible Consider the following OpenCL kernel code: Assuming that array X is the same array from the example, the first 16 threads will access addresses 0x00001232 through 0x00001272 If each request was sent out individually, there would be 16 accesses total, with 64 useful bytes of data and 448 wasted bytes Notice that each access in the same 32-byte range would return exactly the same data inttmp = X[get_global_id(0)]; 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Coalescing Memory Accesses When GPU threads access data in the same 32-byte range, multiple accesses are combined so that each range is only accessed once Combining accesses is called coalescing For this example, only 3 accesses are required If the start of the array was 256-bit aligned, only two accesses would be required ? ? ? 0 1 2 3 4 0x00001220 5 6 7 8 9 10 11 12 0x00001240 13 14 15 ? ? ? ? ? 0x00001260 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Coalescing Memory Accesses Recall that for AMD hardware, 64 threads are part of a wavefront and must execute the same instruction in a SIMD manner For the AMD 5870 GPU, memory accesses of 16 consecutive threads are evaluated together and can be coalesced to fully utilize the bus This unit is called a quarter-wavefront and is the important hardware scheduling unit for memory accesses 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Coalescing Memory Accesses Global memory performance for a simple data copying kernel of entirely coalesced and entirely non-coalesced accesses on an ATI Radeon 5870 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Coalescing Memory Accesses Global memory performance for a simple data copying kernel of entirely coalesced and entirely non-coalesced accesses on an NVIDIA GTX 285 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Memory Banks Memory is made up of banks Memory banks are the hardware units that actually store data The memory banks targeted by a memory access depend on the address of the data to be read/written Note that on current GPUs, there are more memory banks than can be addressed at once by the global memory bus, so it is possible for different accesses to target different banks Bank response time, not access requests, is the bottleneck Successive data are stored in successive banks (strides of 32-bit words on GPUs) so that a group of threads accessing successive elements will produce no bank conflicts 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Bank Conflicts – Local Memory Bank conflicts have the largest negative effect on local memory operations Local memory does not require that accesses are to sequentially increasing elements Accesses from successive threads should target different memory banks Threads accessing sequentially increasing data will fall into this category 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Bank Conflicts – Local Memory On AMD, a wavefront that generates bank conflicts stalls until all local memory operations complete The hardware does not hide the stall by switching to another wavefront The following examples show local memory access patterns and whether conflicts are generated For readability, only 8 memory banks are shown 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Bank Conflicts – Local Memory If there are no bank conflicts, each bank can return an element without any delays Both of the following patterns will complete without stalls on current GPU hardware 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Thread Memory Bank Thread Memory Bank 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Bank Conflicts – Local Memory If multiple accesses occur to the same bank, then the bank with the most conflicts will determine the latency The following pattern will take 3 times the access latency to complete Conflicts 0 0 1 1 2 1 2 2 3 3 3 4 4 5 5 1 6 6 1 7 7 Thread Memory Bank 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Bank Conflicts – Local Memory If all accesses are to the same address, then the bank can perform a broadcast and no delay is incurred The following will only take one access to complete assuming the same data element is accessed 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Thread Memory Bank 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Bank Conflicts – Global Memory Bank conflicts in global memory rely on the same principles, however the global memory bus makes the impact of conflicts more subtle Since accessing data in global memory requires that an entire bus-line be read, bank conflicts within a work-group have a similar effect as non-coalesced accesses If threads reading from global memory had a bank conflict then by definition it manifest as a non-coalesced access Not all non-coalesced accesses are bank conflicts, however The ideal case for global memory is when different work-groups read from different banks In reality, this is a very low-level optimization and should not be prioritized when first writing a program 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Summary GPU memory is different than CPU memory The goal is high throughput instead of low-latency Memory access patterns have a huge impact on bus utilization Low utilization means low performance Having coalesced memory accesses and avoiding bank conflicts are required for high performance code Specific hardware information (such as bus width, number of memory banks, and number of threads that coalesce memory requests) is GPU-specific and can be found in vendor documentation 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  1. This is the start of an example explaining how GPU’s wide memory buses work.
  2. The desired element is at 0x00001232 as shown 2 slides back.