SlideShare a Scribd company logo
Javier Cabezas
Isaac Gelado
Lluís Vilanova
Nacho Navarro
Wen-Mei Hwu
NVIDIA GTC
May 17th, 2012
BSC/Multicoreware
BSC/Multicoreware
BSC
UPC/BSC
UIUC/Multicoreware
GMAC 2
Easy and efficient
programming for
CUDA-based systems
NVIDIA GPU Technology Conference – May, 17th , 2012 2
• BSC develops a wide range of High Performance Computing applications
• Alya (Computational Mechanics), RTM (Seismic imaging), Meteorological
modeling, Air quality, …
• Explore the trends in computer architecture (GPUs, FPGAs, …)
• Programming models research group
• OmpSs: task-based programming model that exploits parallelism at different
levels (multi-core, multi-GPU, multi-node, …)
• Accelerators research group
• System support for accelerators
• Collaboration with UIUC and Multicoreware Inc.
└Barcelona Supercomputing Center (BSC)
Introduction
NVIDIA GPU Technology Conference – May, 17th , 2012 3
• Goal: programmability of heterogeneous parallel architectures
• Task-based programming
model
• Simple pragma annotations
• Automatic task dependency
tracking
• Needs support for a growing
number of devices, APIs and
platforms
└OmpSs
Introduction
#pragma omp task inout([NB*NB] A)
void spotrf_tile(float *A,long NB);
#pragma omp task input([NB*NB] A, [NB*NB] B)
inout([NB*NB] C)
void sgemm_tile(float *A, float *B, float *C, u_long NB);
#pragma omp task input([NB*NB] A) inout([NB*NB] C)
void syrk_tile( float *A, float *C, long NB)
for (int j = 0; j < N; j+=BS) {
for (int k= 0; k< j; k+=BS)
for (int i = j+BS; i < N; i+=BS)
sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]);
for (int i = 0; i < j; i+=BS)
ssyrk_tile(BS, N, &A[i][j], &A[j][j]);
spotrf_tile(BS, N, &A[j][j]);
for (int i = j+BS; i < N; i+=BS)
strsm_tile(BS, N, &A[j][j], &A[j][i]);
}
NVIDIA GPU Technology Conference – May, 17th , 2012 4
• GMAC provides an easy system-level programming model for accelerator-based
systems
• Shared-memory model
• Single pointer per allocation
• No explicit copies
• Portability across systems
• Hardware capabilities
• System configuration
• Portability across platforms
• CUDA, OpenCL
• Windows, Linux, MacOS X
└GMAC
Introduction
void vec_add(size_t N)
{
float *a, *b, *c;
gmacMalloc(&a, N * sizeof(float));
gmacMalloc(&b, N * sizeof(float));
gmacMalloc(&c, N * sizeof(float));
// Runs on the CPU
init_array(a, N);
init_array(b, N);
// Runs on the GPU
add_kernel<<<N/512, N>>>(c, a, b, N);
}
NVIDIA GPU Technology Conference – May, 17th , 2012 5
• Since last GTC, many exciting things have happened
• CUDA 4
• Fermi/Kepler
• UVAS
• Peer-to-peer memory transfers
• Host threads can access any device memory (cudaMemcpy)
• …
• … but most of them were already implemented in GMAC
• So we have been going forward
└GMAC
Introduction
NVIDIA GPU Technology Conference – May, 17th , 2012 6
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
NVIDIA GPU Technology Conference – May, 17th , 2012 7
• New memory addressing capabilities
• Private memories
• Copies between memories through PCIe
• Host-mapped memory (Compute Capability 1.1)
• e.g. Output buffers in systems with discrete GPUs
• Avoid unnecessary copies in shared memory
systems (e.g. ION)
• Peer-memory access (Compute Capability 2.0)
• Avoid inter-GPU copies
• Only for GPUs in the same PCIe bus
PCIe
GDDR
GPU
└A universe of capabilities
GDDR
GPU
DDR
CPU
GPU
Programmability issues in CUDA systems
NVIDIA GPU Technology Conference – May, 17th , 2012 8
• New memory addressing capabilities (performance)
└A universe of capabilities
Programmability issues in CUDA systems
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
Time(ms)
3D volume size
c) GPU 3D stencil computation (25 point)
Copy
Peer
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
Time(ms)
Matrix size
a) Matrix multiplication (local in, remote out)
CopyOutput
PeerOutput
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
Time(ms)
Matrix size
b) Matrix multiplication (remote in, local out)
CopyInput
PeerInput
NVIDIA GPU Technology Conference – May, 17th , 2012 9
• Execution concurrency
• Concurrency between GPU execution and
memory transfers (Compute Capability 1.1)
• Concurrent HostToDevice/DeviceToHost memory
transfers (Compute Capability 2.0)
• Concurrent kernel execution (Compute Capability 2.0)
• Limited to back-to-back execution
• Limited to a single CUDA context
• Complete concurrent kernel execution (Compute Capability 3.5)
└A universe of capabilities
Programmability issues in CUDA systems
GPU
CPU
PCIe
NVIDIA GPU Technology Conference – May, 17th , 2012 10
• Describing memory accessibility
• Private memory: cudaMalloc/malloc
• GPU → Host shared memory: cudaHostRegister/cudaHostAlloc
• Host memory pinned and mapped on the GPUs’ page tables
• GPU↔GPU shared memory : cudaDeviceEnablePeerAccess
• GPU page tables synchronized
• Context-grain sharing (and only works for contexts on different GPUs)
• Describing dependencies among operations
• CUDA stream: in-order queue to execute implicitly dependent operations
• CUDA event: marker in a stream that allows to use finer-grain synchronization
└Access to hardware capabilities
Programmability issues in CUDA systems
NVIDIA GPU Technology Conference – May, 17th , 2012 11
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
NVIDIA GPU Technology Conference – May, 17th , 2012 12
• A process provides a single logical
address space
• Defines the memory objects that can be
accessed in a program
• Implemented on top of multiple virtual
address spaces
• An address space defines which memory
objects can be accessed from a device
• In contrast to the unfied virtual address
space offered by CUDA
└Memory model
CPU
GPU 1
GPU 2
Process
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 13
• Memory objects can be made accessible to any device
• Single allocation + remote access
• Replication + software-based coherence is used when
no hardware access is available
• Each virtual address space (AS) can have
different views of an object
• A view is created by mapping an object on
a virtual address space
• Properties of the view define the behavior
of the coherence protocol (DSM)
└Memory model
Obj
View
View
View
R/W
R/W
R
GMAC 2
AS 1
AS 2
AS 3
NVIDIA GPU Technology Conference – May, 17th , 2012 14
└Memory model
b) Host-mapped
memory
a) Replication
Software-based coherence
o p
Peer memory access
Physical
Virtual
Logical
CPU GPU 1 GPU 2
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 15
• Automatic optimizations
• Pinned memory management for memory transfers
• Eager memory transfers
• Features available in the hardware are used to optimize some scenarios
• Implementation details: restrictions
• CUDA UVAS makes replicated objects have different virtual addresses
• Stored pointers cannot be used on the device 
• Mappings work at page granularity
└Memory model
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 16
• A process contains several execution contexts
• The execution context (OS thread) is the basic unit of
concurrency
• Each execution context is assigned a default GPU
• Inherited on creation
• Like CUDA 4 “current” device (cudaSetDevice)
• CUDA code is executed on…
• Implicitly: using the default GPU
• Explicitly: a specific GPU (using the stream parameter of the call)
└Execution model
GPU1
GPU2
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 17
• Data accessible on a GPU is implicitly
acquired/released at kernel call boundaries
• Provides implicit synchronization on data
dependencies
• Dependencies between kernel calls are defined
by the programmer using the standard
inter-thread synchronization mechanisms
• Explicit consistency is also available for fine tuning
└Execution model
GMAC 2
o
o
p
p
CPU GPU1
NVIDIA GPU Technology Conference – May, 17th , 2012 18
-> release ownership of o
-> release ownership of p
kernel<<<grid, block>>>(o, p);
for (size_t i = 0; i < SIZE_C; ++i) {
q[i]++;
}
gmacThreadSyncronize();
-> acquire ownership of o
-> acquire ownership of p
└Explicit consistency
GMAC 2
o o
p p
q
q
CPU GPU1
NVIDIA GPU Technology Conference – May, 17th , 2012 19
• Hides specific CUDA abstractions needed for execution concurrency
(streams, events…)
• Allows concurrent code execution and memory transfers and back-to-
back kernel execution
• When triggered from different execution contexts (host threads)
└Execution model
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 20
• C/C++
• More flexible and composable API than GMAC
• Allows to specify memory access rights
• Allows explicit consistency management
• POSIX-like semantics
• GMAC primitives are still provided
• Easily implemented on top of the new API
• Single interface for CUDA/OpenCL programs
└GMAC 2 API
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 21
void * gmac::map(void *ptr, size_t count, int prot,
int flags, device_id id)
void * gmac::map(const std::string &path, off_t offset, size_t count,
int prot, int flags, device_id id)
• ptr == NULL creates a new logical object, and creates a view of the object
for the device id
• ptr != NULL creates a view of the specified object for the device id
• prot declares the kind of access to the object: RO,W,RW
• flags allows to specify some special properties for the view
• e.g. gmac::MAP_FIXED: use the address ptr for the new view
└Memory management API
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 22
error gmac::set_gpu(device_id id)
• Sets id as the default GPU to be used in implicit operations on the current
execution context
device_id gmac::get_gpu()
• Gets the id of the default GPU assigned to the current execution context
└Execution model API
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 23
error func<<<grid, block, shared, device_id>>>(...)
• Executes the kernel func on the GPU with identifier device_id. If no
device is specified, the default one is used
• Before executing, transfers the ownership of all the GPU accessible objects to
the GPU…
• … and acquires the ownership for the CPU again after kernel execution
• Alternative syntax for OpenCL based on C++11 variadic templates
kerne.launch::launch(config global,
config block,
config offset, device_id)(...)
└Execution model API
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 24
error gmac::acquire(void *ptr, size_t count, int prot, device_id id)
• Ensures that:
1) Device id sees an updated copy of the given data object
2) Device id has the ownership of the object
• prot specifies weaker or equal access level than in map
error gmac::release(void *ptr, size_t count)
• Releases the ownership of the data object
└Memory coherence/consistency API
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 25
• Previous calls are still available
• gmacMalloc, gmacFree, gmacGlobalMalloc
• Implemented on top of the new calls
• You can mix both (if you know what are you doing)
└Backward compatibility
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 26
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
NVIDIA GPU Technology Conference – May, 17th , 2012 27
gmacError_t gmacMalloc(void **a, size_t count)
{
// Map on host memory
*a = gmac::map(NULL, count,
gmac::PROT_READWRITE,
0,
gmac::cpu_id);
// Map on the thread’s current GPU
gmac::map(a, count,
gmac::PROT_READWRITE,
gmac::MAP_FIXED,
gmac::get_gpu());
return gmacSuccess;
}
└Simple shared allocation
GPUCPU
GMAC 2: Use cases
NVIDIA GPU Technology Conference – May, 17th , 2012 28
• Share input data across GPUs (aka gmacGlobalMalloc)
float * a = gmac::map(NULL, count,
gmac::PROT_READWRITE,
0,
gmac::cpu_id);
// Map on GPU1
gmac::map(a, count,
gmac::PROT_READ,
gmac::MAP_FIXED, gpu1_id);
// Map on GPU2
gmac::map(a, count,
gmac::PROT_READ,
gmac::MAP_FIXED, gpu2_id);
└Multi-GPU data replication
GPU2
CPU
GPU1
GMAC 2: Use cases
NVIDIA GPU Technology Conference – May, 17th , 2012 29
• Partition data structures among GPUs
// Map on host memory
float * a = gmac::map(NULL, count,
gmac::PROT_READWRITE,
0,
gmac::cpu_id);
// Map on GPU1 memory
gmac::map(a, count/2,
gmac::PROT_READWRITE,
gmac::MAP_FIXED, gpu1_id);
// Map on GPU2 memory
gmac::map(a + count/2, count/2,
gmac::PROT_READWRITE,
gmac::MAP_FIXED, gpu2_id);
└Data partitioning
GPU2
CPU
GPU1
GMAC 2: Use cases
NVIDIA GPU Technology Conference – May, 17th , 2012 30
• Accessing boundaries across GPU memories (peer memory access)
a = gmac::map(NULL, domain_size,
gmac::PROT_READWRITE,
gmac::MAP_FIXED, gpu1_id);
b = gmac::map(NULL, domain_size,
gmac::PROT_READWRITE,
gmac::MAP_FIXED, gpu2_id);
gmac::map(b, boundary_size,
gmac::PROT_READ,
gmac::MAP_FIXED, gpu1_id);
gmac::map(a + domain_size – boundary_size,
boundary_size,
gmac::PROT_READ,
gmac::MAP_FIXED, gpu2_id);
└Peer-to-peer memory acccess
GPU2GPU1
GMAC 2: Use cases
NVIDIA GPU Technology Conference – May, 17th , 2012 31
• Input from file
a = gmac::map(“input.data”, 0, count,
gmac::PROT_READ,
0, gpu1_id);
• Optimize writing output to file (host-mapped memory)
a = gmac::map(“output.data”, 0, count,
gmac::PROT_WRITE,
0, gpu1_id);
└File I/O
GPU
GPU
GMAC 2: Use cases
NVIDIA GPU Technology Conference – May, 17th , 2012 32
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
NVIDIA GPU Technology Conference – May, 17th , 2012 33
• Eases the programmability on CUDA-based systems
• Offers a flexible API that allows to declare complex data schemes
• Transparently exploits the capabilities of the underlying hardware
• Solves the shortcomings of the previous GMAC version
• Backward compatibility
• We need support in the CUDA driver to fully implement replicated objects
(while keeping the same memory address)
└GMAC 2…
Conclusions
NVIDIA GPU Technology Conference – May, 17th , 2012 34
• HAL (Hardware Abstraction Layer)
• Generic hardware abstractions
• DSM (Distributed Shared Memory)
• Coherence among address spaces using
release consistency
• ULAS (Unified Logical Address Space)
• Provides a single flat logical space for all memories
• System-level programming model
└Design
HAL
Programming model
DSM ULAS
GMAC 2
NVIDIA GPU Technology Conference – May, 17th , 2012 35
• Questions?
Thanks for your attention
NVIDIA GPU Technology Conference – May, 17th , 2012 36
Backup slides
NVIDIA GPU Technology Conference – May, 17th , 2012 37
Programmability issues in CUDA systems
• GPUs are still I/O devices in the OS
• No enforcement of system-level policies
• No standard memory management routines
• No shared memory across processes
• No GPU memory swapping
• Copies to/from I/O devices (GPU Direct is an ad-hoc solution for Infiniband)
• No standard code execution routines
• No scheduling
└Exposing hardware capabilities
NVIDIA GPU Technology Conference – May, 17th , 2012 38
GMAC 2
• Advanced code management and execution
error_t pm::load_module(void *ptr, device_id id)
error_t pm::load_module(const std::string &path,
device_id id)
kernel_t pm::get_kernel(const std::string &path
device_id id)
error_t pm::execute(kernel_t k, config c, arg_list a,
device_id id)
└Programming model
NVIDIA GPU Technology Conference – May, 17th , 2012 39
Backup slides
• Physical description layer: description of the components in the system
• Processing units (e.g., GPUs, CPUs)
• Physical Memories (e.g., CPU memory, GPU memory)
• Physical address spaces
• Aggregation of memories that are directly accessible from a processing unit
• CUDA: GPU Direct 2, Host-mapped memory
• OpenCL: clEnqueueMap
└GMAC 2: Hardware Abstraction Layer
NVIDIA GPU Technology Conference – May, 17th , 2012 40
GMAC 2
• System abstractions
• Virtual address space
• struct mm_struct
• Object: physical memory
• Collection of struct page
• View: mapping of an object in a vAS
• struct vm_area_struct
• Virtual processing unit: execution context to be executed on a processing unit
• struct task_struct
└Hardware Abstraction Layer
NVIDIA GPU Technology Conference – May, 17th , 2012 41
GMAC 2
• Platform-independent
• CUDA, OpenCL
• Fat pointers
• Transparent inter-device copies
hal::copy(hal::ptr dst, hal::ptr src, size_t count)
• I/O devices
• Event-based API
• Dependency tracking and synchronization
• Timing and tracing
└Hardware Abstraction Layer
NVIDIA GPU Technology Conference – May, 17th , 2012 42
GMAC 2
• Links memory areas:
dsm::link(hal::ptr dst, hal::ptr src, size_t count, int flags);
• Acquire/release concistency
dsm::acquire(hal::ptr p, size_t count, int flags);
dsm::release(hal::ptr p, size_t count);
• Depending on the conditions, optimizations are enabled
• Eager transfer
└Distributed Shared Memory
NVIDIA GPU Technology Conference – May, 17th , 2012 43
GMAC 2
• Since the introduction of CUDA 4.0… but!
• Not in 32-bit machines
• Data structures must reside in pinned memory
└Unified Virtual Address Space

More Related Content

What's hot

GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
Ofer Rosenberg
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
AMD Developer Central
 
Qt 6 Chat - Are You Ready?
Qt 6 Chat - Are You Ready?Qt 6 Chat - Are You Ready?
Qt 6 Chat - Are You Ready?
ICS
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
Ofer Rosenberg
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
AMD Developer Central
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
AMD Developer Central
 
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvPersistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
Intel® Software
 
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
AMD Developer Central
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Stefano Di Carlo
 
WebKit and GStreamer
WebKit and GStreamerWebKit and GStreamer
WebKit and GStreamer
calvaris
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
AMD Developer Central
 
CUDA
CUDACUDA
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian BallantyneWT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
AMD Developer Central
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
AMD Developer Central
 
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovGS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
AMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
AMD Developer Central
 

What's hot (20)

GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
Qt 6 Chat - Are You Ready?
Qt 6 Chat - Are You Ready?Qt 6 Chat - Are You Ready?
Qt 6 Chat - Are You Ready?
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvPersistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
 
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
WebKit and GStreamer
WebKit and GStreamerWebKit and GStreamer
WebKit and GStreamer
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
 
CUDA
CUDACUDA
CUDA
 
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian BallantyneWT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovGS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 

Similar to S0333 gtc2012-gmac-programming-cuda

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
Kelum Senanayake
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
Shree Kumar
 
Cuda intro
Cuda introCuda intro
Cuda intro
Anshul Sharma
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Univa, an Altair Company
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
Dhan V Sagar
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
bakers84
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
 
GPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor GraphicsGPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor Graphics
LinuxCon ContainerCon CloudOpen China
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
The Linux Foundation
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
haythem_2015
 
Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmap
Manolis Vavalis
 

Similar to S0333 gtc2012-gmac-programming-cuda (20)

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
GPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor GraphicsGPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor Graphics
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
 
Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmap
 

More from mistercteam

Preliminary xsx die_fact_finding
Preliminary xsx die_fact_findingPreliminary xsx die_fact_finding
Preliminary xsx die_fact_finding
mistercteam
 
20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets
mistercteam
 
201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu
mistercteam
 
3 673 (1)
3 673 (1)3 673 (1)
3 673 (1)
mistercteam
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
mistercteam
 
5 baker oxide (1)
5 baker oxide (1)5 baker oxide (1)
5 baker oxide (1)
mistercteam
 
The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805
mistercteam
 
Lecture14
Lecture14Lecture14
Lecture14
mistercteam
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
mistercteam
 
Gdce 2010 dx11
Gdce 2010 dx11Gdce 2010 dx11
Gdce 2010 dx11
mistercteam
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakov
mistercteam
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam
 
Mantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referenceMantle programming-guide-and-api-reference
Mantle programming-guide-and-api-reference
mistercteam
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
mistercteam
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
mistercteam
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
mistercteam
 
Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12
mistercteam
 

More from mistercteam (17)

Preliminary xsx die_fact_finding
Preliminary xsx die_fact_findingPreliminary xsx die_fact_finding
Preliminary xsx die_fact_finding
 
20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets
 
201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu
 
3 673 (1)
3 673 (1)3 673 (1)
3 673 (1)
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
 
5 baker oxide (1)
5 baker oxide (1)5 baker oxide (1)
5 baker oxide (1)
 
The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805
 
Lecture14
Lecture14Lecture14
Lecture14
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
 
Gdce 2010 dx11
Gdce 2010 dx11Gdce 2010 dx11
Gdce 2010 dx11
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakov
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
Mantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referenceMantle programming-guide-and-api-reference
Mantle programming-guide-and-api-reference
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12
 

Recently uploaded

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 

Recently uploaded (20)

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 

S0333 gtc2012-gmac-programming-cuda

  • 1. Javier Cabezas Isaac Gelado Lluís Vilanova Nacho Navarro Wen-Mei Hwu NVIDIA GTC May 17th, 2012 BSC/Multicoreware BSC/Multicoreware BSC UPC/BSC UIUC/Multicoreware GMAC 2 Easy and efficient programming for CUDA-based systems
  • 2. NVIDIA GPU Technology Conference – May, 17th , 2012 2 • BSC develops a wide range of High Performance Computing applications • Alya (Computational Mechanics), RTM (Seismic imaging), Meteorological modeling, Air quality, … • Explore the trends in computer architecture (GPUs, FPGAs, …) • Programming models research group • OmpSs: task-based programming model that exploits parallelism at different levels (multi-core, multi-GPU, multi-node, …) • Accelerators research group • System support for accelerators • Collaboration with UIUC and Multicoreware Inc. └Barcelona Supercomputing Center (BSC) Introduction
  • 3. NVIDIA GPU Technology Conference – May, 17th , 2012 3 • Goal: programmability of heterogeneous parallel architectures • Task-based programming model • Simple pragma annotations • Automatic task dependency tracking • Needs support for a growing number of devices, APIs and platforms └OmpSs Introduction #pragma omp task inout([NB*NB] A) void spotrf_tile(float *A,long NB); #pragma omp task input([NB*NB] A, [NB*NB] B) inout([NB*NB] C) void sgemm_tile(float *A, float *B, float *C, u_long NB); #pragma omp task input([NB*NB] A) inout([NB*NB] C) void syrk_tile( float *A, float *C, long NB) for (int j = 0; j < N; j+=BS) { for (int k= 0; k< j; k+=BS) for (int i = j+BS; i < N; i+=BS) sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]); for (int i = 0; i < j; i+=BS) ssyrk_tile(BS, N, &A[i][j], &A[j][j]); spotrf_tile(BS, N, &A[j][j]); for (int i = j+BS; i < N; i+=BS) strsm_tile(BS, N, &A[j][j], &A[j][i]); }
  • 4. NVIDIA GPU Technology Conference – May, 17th , 2012 4 • GMAC provides an easy system-level programming model for accelerator-based systems • Shared-memory model • Single pointer per allocation • No explicit copies • Portability across systems • Hardware capabilities • System configuration • Portability across platforms • CUDA, OpenCL • Windows, Linux, MacOS X └GMAC Introduction void vec_add(size_t N) { float *a, *b, *c; gmacMalloc(&a, N * sizeof(float)); gmacMalloc(&b, N * sizeof(float)); gmacMalloc(&c, N * sizeof(float)); // Runs on the CPU init_array(a, N); init_array(b, N); // Runs on the GPU add_kernel<<<N/512, N>>>(c, a, b, N); }
  • 5. NVIDIA GPU Technology Conference – May, 17th , 2012 5 • Since last GTC, many exciting things have happened • CUDA 4 • Fermi/Kepler • UVAS • Peer-to-peer memory transfers • Host threads can access any device memory (cudaMemcpy) • … • … but most of them were already implemented in GMAC • So we have been going forward └GMAC Introduction
  • 6. NVIDIA GPU Technology Conference – May, 17th , 2012 6 • Programmability issues in CUDA systems • GMAC 2 • Use cases • Conclusions Outline
  • 7. NVIDIA GPU Technology Conference – May, 17th , 2012 7 • New memory addressing capabilities • Private memories • Copies between memories through PCIe • Host-mapped memory (Compute Capability 1.1) • e.g. Output buffers in systems with discrete GPUs • Avoid unnecessary copies in shared memory systems (e.g. ION) • Peer-memory access (Compute Capability 2.0) • Avoid inter-GPU copies • Only for GPUs in the same PCIe bus PCIe GDDR GPU └A universe of capabilities GDDR GPU DDR CPU GPU Programmability issues in CUDA systems
  • 8. NVIDIA GPU Technology Conference – May, 17th , 2012 8 • New memory addressing capabilities (performance) └A universe of capabilities Programmability issues in CUDA systems 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 Time(ms) 3D volume size c) GPU 3D stencil computation (25 point) Copy Peer 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 Time(ms) Matrix size a) Matrix multiplication (local in, remote out) CopyOutput PeerOutput 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 Time(ms) Matrix size b) Matrix multiplication (remote in, local out) CopyInput PeerInput
  • 9. NVIDIA GPU Technology Conference – May, 17th , 2012 9 • Execution concurrency • Concurrency between GPU execution and memory transfers (Compute Capability 1.1) • Concurrent HostToDevice/DeviceToHost memory transfers (Compute Capability 2.0) • Concurrent kernel execution (Compute Capability 2.0) • Limited to back-to-back execution • Limited to a single CUDA context • Complete concurrent kernel execution (Compute Capability 3.5) └A universe of capabilities Programmability issues in CUDA systems GPU CPU PCIe
  • 10. NVIDIA GPU Technology Conference – May, 17th , 2012 10 • Describing memory accessibility • Private memory: cudaMalloc/malloc • GPU → Host shared memory: cudaHostRegister/cudaHostAlloc • Host memory pinned and mapped on the GPUs’ page tables • GPU↔GPU shared memory : cudaDeviceEnablePeerAccess • GPU page tables synchronized • Context-grain sharing (and only works for contexts on different GPUs) • Describing dependencies among operations • CUDA stream: in-order queue to execute implicitly dependent operations • CUDA event: marker in a stream that allows to use finer-grain synchronization └Access to hardware capabilities Programmability issues in CUDA systems
  • 11. NVIDIA GPU Technology Conference – May, 17th , 2012 11 • Programmability issues in CUDA systems • GMAC 2 • Use cases • Conclusions Outline
  • 12. NVIDIA GPU Technology Conference – May, 17th , 2012 12 • A process provides a single logical address space • Defines the memory objects that can be accessed in a program • Implemented on top of multiple virtual address spaces • An address space defines which memory objects can be accessed from a device • In contrast to the unfied virtual address space offered by CUDA └Memory model CPU GPU 1 GPU 2 Process GMAC 2
  • 13. NVIDIA GPU Technology Conference – May, 17th , 2012 13 • Memory objects can be made accessible to any device • Single allocation + remote access • Replication + software-based coherence is used when no hardware access is available • Each virtual address space (AS) can have different views of an object • A view is created by mapping an object on a virtual address space • Properties of the view define the behavior of the coherence protocol (DSM) └Memory model Obj View View View R/W R/W R GMAC 2 AS 1 AS 2 AS 3
  • 14. NVIDIA GPU Technology Conference – May, 17th , 2012 14 └Memory model b) Host-mapped memory a) Replication Software-based coherence o p Peer memory access Physical Virtual Logical CPU GPU 1 GPU 2 GMAC 2
  • 15. NVIDIA GPU Technology Conference – May, 17th , 2012 15 • Automatic optimizations • Pinned memory management for memory transfers • Eager memory transfers • Features available in the hardware are used to optimize some scenarios • Implementation details: restrictions • CUDA UVAS makes replicated objects have different virtual addresses • Stored pointers cannot be used on the device  • Mappings work at page granularity └Memory model GMAC 2
  • 16. NVIDIA GPU Technology Conference – May, 17th , 2012 16 • A process contains several execution contexts • The execution context (OS thread) is the basic unit of concurrency • Each execution context is assigned a default GPU • Inherited on creation • Like CUDA 4 “current” device (cudaSetDevice) • CUDA code is executed on… • Implicitly: using the default GPU • Explicitly: a specific GPU (using the stream parameter of the call) └Execution model GPU1 GPU2 GMAC 2
  • 17. NVIDIA GPU Technology Conference – May, 17th , 2012 17 • Data accessible on a GPU is implicitly acquired/released at kernel call boundaries • Provides implicit synchronization on data dependencies • Dependencies between kernel calls are defined by the programmer using the standard inter-thread synchronization mechanisms • Explicit consistency is also available for fine tuning └Execution model GMAC 2 o o p p CPU GPU1
  • 18. NVIDIA GPU Technology Conference – May, 17th , 2012 18 -> release ownership of o -> release ownership of p kernel<<<grid, block>>>(o, p); for (size_t i = 0; i < SIZE_C; ++i) { q[i]++; } gmacThreadSyncronize(); -> acquire ownership of o -> acquire ownership of p └Explicit consistency GMAC 2 o o p p q q CPU GPU1
  • 19. NVIDIA GPU Technology Conference – May, 17th , 2012 19 • Hides specific CUDA abstractions needed for execution concurrency (streams, events…) • Allows concurrent code execution and memory transfers and back-to- back kernel execution • When triggered from different execution contexts (host threads) └Execution model GMAC 2
  • 20. NVIDIA GPU Technology Conference – May, 17th , 2012 20 • C/C++ • More flexible and composable API than GMAC • Allows to specify memory access rights • Allows explicit consistency management • POSIX-like semantics • GMAC primitives are still provided • Easily implemented on top of the new API • Single interface for CUDA/OpenCL programs └GMAC 2 API GMAC 2
  • 21. NVIDIA GPU Technology Conference – May, 17th , 2012 21 void * gmac::map(void *ptr, size_t count, int prot, int flags, device_id id) void * gmac::map(const std::string &path, off_t offset, size_t count, int prot, int flags, device_id id) • ptr == NULL creates a new logical object, and creates a view of the object for the device id • ptr != NULL creates a view of the specified object for the device id • prot declares the kind of access to the object: RO,W,RW • flags allows to specify some special properties for the view • e.g. gmac::MAP_FIXED: use the address ptr for the new view └Memory management API GMAC 2
  • 22. NVIDIA GPU Technology Conference – May, 17th , 2012 22 error gmac::set_gpu(device_id id) • Sets id as the default GPU to be used in implicit operations on the current execution context device_id gmac::get_gpu() • Gets the id of the default GPU assigned to the current execution context └Execution model API GMAC 2
  • 23. NVIDIA GPU Technology Conference – May, 17th , 2012 23 error func<<<grid, block, shared, device_id>>>(...) • Executes the kernel func on the GPU with identifier device_id. If no device is specified, the default one is used • Before executing, transfers the ownership of all the GPU accessible objects to the GPU… • … and acquires the ownership for the CPU again after kernel execution • Alternative syntax for OpenCL based on C++11 variadic templates kerne.launch::launch(config global, config block, config offset, device_id)(...) └Execution model API GMAC 2
  • 24. NVIDIA GPU Technology Conference – May, 17th , 2012 24 error gmac::acquire(void *ptr, size_t count, int prot, device_id id) • Ensures that: 1) Device id sees an updated copy of the given data object 2) Device id has the ownership of the object • prot specifies weaker or equal access level than in map error gmac::release(void *ptr, size_t count) • Releases the ownership of the data object └Memory coherence/consistency API GMAC 2
  • 25. NVIDIA GPU Technology Conference – May, 17th , 2012 25 • Previous calls are still available • gmacMalloc, gmacFree, gmacGlobalMalloc • Implemented on top of the new calls • You can mix both (if you know what are you doing) └Backward compatibility GMAC 2
  • 26. NVIDIA GPU Technology Conference – May, 17th , 2012 26 • Programmability issues in CUDA systems • GMAC 2 • Use cases • Conclusions Outline
  • 27. NVIDIA GPU Technology Conference – May, 17th , 2012 27 gmacError_t gmacMalloc(void **a, size_t count) { // Map on host memory *a = gmac::map(NULL, count, gmac::PROT_READWRITE, 0, gmac::cpu_id); // Map on the thread’s current GPU gmac::map(a, count, gmac::PROT_READWRITE, gmac::MAP_FIXED, gmac::get_gpu()); return gmacSuccess; } └Simple shared allocation GPUCPU GMAC 2: Use cases
  • 28. NVIDIA GPU Technology Conference – May, 17th , 2012 28 • Share input data across GPUs (aka gmacGlobalMalloc) float * a = gmac::map(NULL, count, gmac::PROT_READWRITE, 0, gmac::cpu_id); // Map on GPU1 gmac::map(a, count, gmac::PROT_READ, gmac::MAP_FIXED, gpu1_id); // Map on GPU2 gmac::map(a, count, gmac::PROT_READ, gmac::MAP_FIXED, gpu2_id); └Multi-GPU data replication GPU2 CPU GPU1 GMAC 2: Use cases
  • 29. NVIDIA GPU Technology Conference – May, 17th , 2012 29 • Partition data structures among GPUs // Map on host memory float * a = gmac::map(NULL, count, gmac::PROT_READWRITE, 0, gmac::cpu_id); // Map on GPU1 memory gmac::map(a, count/2, gmac::PROT_READWRITE, gmac::MAP_FIXED, gpu1_id); // Map on GPU2 memory gmac::map(a + count/2, count/2, gmac::PROT_READWRITE, gmac::MAP_FIXED, gpu2_id); └Data partitioning GPU2 CPU GPU1 GMAC 2: Use cases
  • 30. NVIDIA GPU Technology Conference – May, 17th , 2012 30 • Accessing boundaries across GPU memories (peer memory access) a = gmac::map(NULL, domain_size, gmac::PROT_READWRITE, gmac::MAP_FIXED, gpu1_id); b = gmac::map(NULL, domain_size, gmac::PROT_READWRITE, gmac::MAP_FIXED, gpu2_id); gmac::map(b, boundary_size, gmac::PROT_READ, gmac::MAP_FIXED, gpu1_id); gmac::map(a + domain_size – boundary_size, boundary_size, gmac::PROT_READ, gmac::MAP_FIXED, gpu2_id); └Peer-to-peer memory acccess GPU2GPU1 GMAC 2: Use cases
  • 31. NVIDIA GPU Technology Conference – May, 17th , 2012 31 • Input from file a = gmac::map(“input.data”, 0, count, gmac::PROT_READ, 0, gpu1_id); • Optimize writing output to file (host-mapped memory) a = gmac::map(“output.data”, 0, count, gmac::PROT_WRITE, 0, gpu1_id); └File I/O GPU GPU GMAC 2: Use cases
  • 32. NVIDIA GPU Technology Conference – May, 17th , 2012 32 • Programmability issues in CUDA systems • GMAC 2 • Use cases • Conclusions Outline
  • 33. NVIDIA GPU Technology Conference – May, 17th , 2012 33 • Eases the programmability on CUDA-based systems • Offers a flexible API that allows to declare complex data schemes • Transparently exploits the capabilities of the underlying hardware • Solves the shortcomings of the previous GMAC version • Backward compatibility • We need support in the CUDA driver to fully implement replicated objects (while keeping the same memory address) └GMAC 2… Conclusions
  • 34. NVIDIA GPU Technology Conference – May, 17th , 2012 34 • HAL (Hardware Abstraction Layer) • Generic hardware abstractions • DSM (Distributed Shared Memory) • Coherence among address spaces using release consistency • ULAS (Unified Logical Address Space) • Provides a single flat logical space for all memories • System-level programming model └Design HAL Programming model DSM ULAS GMAC 2
  • 35. NVIDIA GPU Technology Conference – May, 17th , 2012 35 • Questions? Thanks for your attention
  • 36. NVIDIA GPU Technology Conference – May, 17th , 2012 36 Backup slides
  • 37. NVIDIA GPU Technology Conference – May, 17th , 2012 37 Programmability issues in CUDA systems • GPUs are still I/O devices in the OS • No enforcement of system-level policies • No standard memory management routines • No shared memory across processes • No GPU memory swapping • Copies to/from I/O devices (GPU Direct is an ad-hoc solution for Infiniband) • No standard code execution routines • No scheduling └Exposing hardware capabilities
  • 38. NVIDIA GPU Technology Conference – May, 17th , 2012 38 GMAC 2 • Advanced code management and execution error_t pm::load_module(void *ptr, device_id id) error_t pm::load_module(const std::string &path, device_id id) kernel_t pm::get_kernel(const std::string &path device_id id) error_t pm::execute(kernel_t k, config c, arg_list a, device_id id) └Programming model
  • 39. NVIDIA GPU Technology Conference – May, 17th , 2012 39 Backup slides • Physical description layer: description of the components in the system • Processing units (e.g., GPUs, CPUs) • Physical Memories (e.g., CPU memory, GPU memory) • Physical address spaces • Aggregation of memories that are directly accessible from a processing unit • CUDA: GPU Direct 2, Host-mapped memory • OpenCL: clEnqueueMap └GMAC 2: Hardware Abstraction Layer
  • 40. NVIDIA GPU Technology Conference – May, 17th , 2012 40 GMAC 2 • System abstractions • Virtual address space • struct mm_struct • Object: physical memory • Collection of struct page • View: mapping of an object in a vAS • struct vm_area_struct • Virtual processing unit: execution context to be executed on a processing unit • struct task_struct └Hardware Abstraction Layer
  • 41. NVIDIA GPU Technology Conference – May, 17th , 2012 41 GMAC 2 • Platform-independent • CUDA, OpenCL • Fat pointers • Transparent inter-device copies hal::copy(hal::ptr dst, hal::ptr src, size_t count) • I/O devices • Event-based API • Dependency tracking and synchronization • Timing and tracing └Hardware Abstraction Layer
  • 42. NVIDIA GPU Technology Conference – May, 17th , 2012 42 GMAC 2 • Links memory areas: dsm::link(hal::ptr dst, hal::ptr src, size_t count, int flags); • Acquire/release concistency dsm::acquire(hal::ptr p, size_t count, int flags); dsm::release(hal::ptr p, size_t count); • Depending on the conditions, optimizations are enabled • Eager transfer └Distributed Shared Memory
  • 43. NVIDIA GPU Technology Conference – May, 17th , 2012 43 GMAC 2 • Since the introduction of CUDA 4.0… but! • Not in 32-bit machines • Data structures must reside in pinned memory └Unified Virtual Address Space