SlideShare a Scribd company logo
Dark Secrets:
Heterogeneous Memory Models
Lee Howes
2015-02-07, Qualcomm Technologies Inc.
2
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT, and QWI. References
to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective owners.
3
Introduction – why
memory consistency
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
4
Many programming languages have coped with weak memory models
These are fixed in various ways:
− Platform-specific rules
− Conservative compilation behavior
A clear memory model allows developers to understand their program behavior!
− Or part of it anyway. It’s not the only requirement but it is a start.
Many current models are based on the Data-Race-Free work
− Special operations are used to maintain order
− Between special instructions reordering is valid for efficiency
Weak memory models
5
Many languages have tried to formalize their memory models
− Java, C++, C…
− Why?
OpenCL and HSA are not exceptions
− Both have developed models based on the DRF work, via C11/C++11
Ben Gaster, Derek Hower and I have collaborated on HRF-Relaxed
− An extension of DRF models for heterogeneous systems
− To appear in ACM TACO
Unfortunately, even a stronger memory model doesn’t solve all your problems…
Better memory models
6
Current
restrictions
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
7
Three specific issues:
− Coarse grained access to data
− Separating addressing between devices and the host process
− Weak and poorly defined ordering controls
Any of these can be worked around with vendor extensions or device knowledge
Weakness in the OpenCL 1.x memory model
8
Host controls data
Coarse host->device synchronization
Data
Host
OpenCL
Device
Allocate
9
A single running command owns an entire allocation
Coarse host->device synchronization
Data
Host
OpenCL
Device
Run command
10
The host can access it using map/unmap operations
Coarse host->device synchronization
Data
Host
OpenCL
Device
Map
11
Coarse host->device synchronization
Data
Host
OpenCL
Device
Unmap
12
Coarse host->device synchronization
Data
Host
OpenCL
Device
Run command
13
Separate addressing
We can update memory with a pointer
Host
OpenCL
Device
data[n].ptr = &data[n+1];
data[n+1].val = 3;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
14
Separate addressing
We try to read it – but what is the value of a?
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
15
Separate addressing
Unfortunately, the address of data may change
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
16
Bounds on visibility
Controlling memory ordering is challenging
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
17
Physically, this probably means within a single core
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Write
18
Barrier operations synchronize active threads and constituent work-items
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Barrier
19
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read
20
Bounds on visibility
Between groups we can’t
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read?
21
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Write
22
Ensure the write completes for a given work-item
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Fence
23
Fence at the other end
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Fence
24
Ensure a read is after the fence
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read
25
Probably seeing a write that was written after a fence guarantees that the fence completed
− The spec is not very clear on this
There is no coherence guarantee
− Need the write ever complete?
− If it doesn’t complete, who can see it, who can know that the fence happened?
Can the flag be updated without a race?
− For that matter, what is a race?
Spliet et al: KMA.
− Weak ordering differences between platforms due to poorly defined model.
Meaning of fences
When did the fence happen?
{{
data[n] = value;
fence(…);
flag = trigger;
||
if(flag) {
fence(…);
value = data[n];
}}
26
Basics of
OpenCL 2.0
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
27
Sharing virtual addresses
We can update memory with a pointer
Host
OpenCL
Device
data[n].ptr = &data[n+1];
data[n+1].val = 3;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
28
Sharing virtual addresses
We try to read it – but what is the value of a?
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
29
Sharing virtual addresses
Now the address does not change!
Host
OpenCL
Device
int a = data[n].ptr->val;
assert(a == 3);
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
30
Unmap on the host, event dependency on device
Sharing data – when does the value change?
Coarse grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
clUnmapMemObject
Event e
31
Granularity of data race covers the whole buffer
Sharing data – when does the value change?
Coarse grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
clUnmapMemObject
Event e
32
Event dependency on device – caches will flush as necessary
Sharing data – when does the value change?
Fine grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
Event e
33
Data will be merged – data race at byte granularity
Sharing data – when does the value change?
Fine grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
Event e
34
No dispatch-level ordering necessary
Sharing data – when does the value change?
Fine grained with atomic support
Host
OpenCL
Device
int a = atomic-load data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
Atomic-store data[n+1].val = 3;
35
Races at byte level – avoided using the atomics in the memory model
Sharing data – when does the value change?
Fine grained with atomic support
Host
OpenCL
Device
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
int a = atomic-load data[n].ptr->val;
data[n].ptr = &data[n+1];
Atomic-store data[n+1].val = 3;
36
This is an ease of programming concern
− Complex apps with complex data structures can more effectively work with data
− Less work to package and repackage data
It can also improve performance
− Less overhead in updating pointers to convert to offsets
− Less overhead in repacking data
− Lower overhead of data copies when appropriate hardware support present
Sharing virtual addresses
37
Most memory operations are entirely unordered
− The compiler can reorder
− The caches can reorder
Unordered relations to the same location are races
− Behaviour is undefined
Ordering operations (atomics) may update the same location
− Ordering operations order other operations relative to the ordering operation
Ordering operations default to sequentially consistent semantics
Sharing virtual addresses
SC-for Data-Race-Free by default – release consistency for flexibility
38
Commonly tried on OpenCL 1.x
− Not at all portable!
− Due to: weak memory ordering, lack of forward progress
Is the situation any better for OpenCL 2.0?
− Yes, memory ordering is now under control!
− Is forward progress?
So what can we safely do with synchronization
Take a spin wait…
39
Yes and no
− Conceptually similar to CPU waiting on an event – ie all work-items to complete
− Other app could occupy device, graphics could consume device, work may interfere with graphics
− Risk of whole device being occupied elsewhere so work on the assumption that this context owns the
device
So what can we safely do with synchronization
CPU thread waits on work-item – works?
CPU threadOpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
40
Probably
− The spec doesn’t guarantee this
− How do you know what core your work-item is on?
− All you know is which work-group it is in.
So what can we safely do with synchronization
Work-item on one core waits for work-item on another core – works?
Core 1
OpenCL Work-item
Core 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
41
Sometimes
− On some architectures this is fine
− On other architectures if both SIMD threads are on the same core: starvation
− Thread 0 may never run to satisfy thread 1’s spin wait
So what can we safely do with synchronization
Work-item on one thread waits for work-item on another thread – works?
{SIMD} thead 1
OpenCL Work-item
{SIMD} thread 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
42
No (well, sometimes yes, but rarely)
− Fairly widely understood, but sometimes hard for new developers
− If you think about the mapping to SIMD it is fairly obvious
− A single program counter can’t be in two places at once – some architectures can track multiple
So what can we safely do with synchronization
Work-item waits for work-item in the same SIMD thread – works?
{SIMD} thead 0
OpenCL Work-item 1
{SIMD} thread 0
OpenCL Work-item 0
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
43
Maybe, maybe not
− It depends entirely on where the work-items are mapped in the group
− Same thread – no
− Different threads – maybe
− The developer can often tell, but it isn’t portable and the compiler can easily break it
So what can we safely do with synchronization
Work-items in a work-group
Work-group 0
OpenCL Work-item
Work-group 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
44
Maybe, maybe not
− It depends entirely on where the work-groups are placed on the device
− Two work-groups on the same core – you have the thread to thread case
− Two work-groups on different cores – it probably works
− No way to control the mapping!
So what can we safely do with synchronization
Work-groups
Work-group 1
OpenCL Work-item
Work-group 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
45
Realistically
− Spin waits on other OpenCL work-items are just not portable
− Very limited use of the memory model
So what can you do?
− Communicating that work has passed a certain point
− Updating shared data buffers with flags
− Lock-free FIFO data structures to share data
− OpenCL 2.0’s sub-group extension provides limited but important forward progress guarantees
So what can we safely do with synchronization
Overall, a fairly poor situation
46
Heterogeneity in
OpenCL 2.0
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
47
Even acquire-release consistency can be expensive
In particular, always synchronizing the whole system is expensive
− A discrete GPU does not want to always make data visible across the PCIe interface
− A single core shuffling data in local cache does not want to interfere with DRAM-consuming throughput
tasks
OpenCL 2 optimizes this using the concept of synchronization scopes
Lowering implementation cost
48
WI WI
SG SG
WG WG
Not all communication is global – so bound it
Hierarchical memory synchronization
Synchronization is expensive
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
49
WI WI
SG SG
WG WG
Sub-group scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
50
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
51
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope; Device scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
52
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope; Device scope; All-SVM-Devices scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
53
WI WI
SG SG
WG WG
Hierarchical memory synchronization
Release to the appropriate scope, acquire from the matching scope
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release work-group-scope x load-acquire work-group-scope x
54
WI WI
SG SG
WG WG
Hierarchical memory synchronization
Release to the appropriate scope, acquire from the matching scope
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release device-scope x load-acquire device-scope x
55
WI WI
SG SG
WG WG
Hierarchical memory synchronization
If scopes do not reach far enough, this is a race
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release work-group-scope x load-acquire work-group-scope x
56
Allows aggressive hardware optimization to coherence traffic
− GPU coherence in particular is often expensive – GPUs generate a lot of memory traffic
The memory model defines synchronization rules in terms of scopes
− Insufficient scope is a race
− Non-matching scopes race (in the OpenCL 2.0 model, this isn’t necessary)
Scoped synchronization
57
Four address spaces in OpenCL 2.0
− Constant and private are not relevant for communication
− Global and local maintain separate orders in the memory model
Synchronization, acquire/release behavior etc apply only to local OR global, not both
− The global release->acquire order below does not order the updates to a!
Address space orderings
{SIMD} thead 1
OpenCL Work-item
{SIMD} thread 0
OpenCL Work-item
local-store a = 2
Global-store-release value to flag
while(global-load-acquire flag != value) {}
assert local-load a==2
Local-happens-before
58
Take an example like this:
− void updateRelease(int *flag, int *data);
If I lock the flag, do I safely update data or not?
− The function takes pointers with no address space
− The ordering depends on the address space
− The address space depends on the type of the pointers at the call site, or even earlier!
There are ways to get the fence flags, but it is messy, care must be taken
Multiple orderings in the presence of generic pointers
59
Heterogeneous
memory ordering
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
60
Sequential consistency aims to be a simple, easy to understand, model
− Behave as if the ordering operations are simply interleaved
In the OpenCL model, sequential consistency is guaranteed only in specific cases:
− All memory_order_seq_cst operations have the scope memory_scope_all_svm_devices and all affected
memory locations are contained in system allocations or fine grain SVM buffers with atomics support
− All memory_order_seq_cst operations have the scope memory_scope_device and all affected memory
locations are not located in system allocated regions or fine-grain SVM buffers with atomics support
Consider what this means…
− You start modifying your app to have more fine-grained sharing of some structures
− Suddenly your atomics are not sequentially consistent at all!
− What about SC operations to local memory?
Limits of sequential consistency
61
These are data-race-free memory models
They only guarantee ordering in the absence of races
− So we only actually order things we can observe! Order can be relaxed between atomics.
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG WG
62
In one part of the execution, we have SC operations at device scope
− Let’s assume this is valid
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG WG
63
Elsewhere we have another set of SC operations
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG scope SC
Operations.
Ordered SC.
WG WG
64
Any access across from one work-group to another is a race
− It is equivalent to a non-atomic operation
− Therefore it is invalid
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG WG
An access like this is a race
WG scope SC
Operations.
Ordered SC.
65
In Hower et. al.’s work on Heterogeneous-Race-Free memory models this is made explicit
Sequential consistency can be maintained with scopes
Access to invalid scope
− Is unordered
− Is not observable
− So this is still valid sequential consistency: everything observable, ie that is race-free, is SC
Making sequential consistency a partial order
66
We can apply SC semantics here for the same reason
− Actions to coarse-grained memory is not visible to other clients
− However – coarse buffers don’t fit cleanly in a hierarchical model
In Gaster et al. (to appear ACM TACO) we use the concept of observability as a memory
model extension
− “At any given point in time a given location will be available in a particular set of scope instances out to
some maximum instance and by some set of actors. Only memory operations that are inclusive with that
maximal scope instance will observe changes to those locations.”
Observability can migrate using API actions
− Map, unmap, and event dependencies
Extending to coarse-grained memory
67
Observability
Initial state
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Observability bounds – device scope on the GPU
68
Observability
Map operation
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Observability moves to
device scope on the CPU
69
Observability
Unmap
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Unmap will transfer dependence back to an OpenCL device
70
We can cleanly put scopes and coarse memory in the same memory consistency model
− It is more complicated than, say, C++
− It is practical to formalize, and hopefully easy to explain
There is no need for quirky corner cases
We merely rethink slightly
− Instead of a total order S for all SC operations we have a total order Sa for each agent a of all SC operations
observable by that agent such that all these orders are consistent with each other.
− This is a relatively small step from the DRF idea, which is effectively that non-atomic operations are not
observable, and thus do not need to be strictly ordered.
The point
71
Separate orders for local and global memory are likely to be painful for programmers
Do we need them?
− We have formalized the concept (also in the TACO paper) using multiple happens-before orders that subset
memory operations
− We also describe bridging-synchronization-order as a formal way to allow join points
− It is messy as a formalization. The harm to the developer is probably significant.
Joining address spaces
72
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAM
Texture
Cache
73
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAM
Texture
Cache
Entirely different interfaces from
the processing core
74
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
− We will have to see how this plays out
− Consider this a warning to take care if you try to use this aspect of OpenCL 2.0
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAM
Texture
Cache
Entirely different interfaces from
the processing core
75
Summary
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
76
Like mainstream languages, heterogeneous programming models are adopting firm memory
models
Without fundamental execution model guarantees the usefulness is limited
We are making progress on both these counts
Things are improving
77
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT, and QWI. References
to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective owners.
Thank you
Follow us on:

More Related Content

What's hot

Netw 230 Success Begins / snaptutorial.com
Netw 230  Success Begins / snaptutorial.comNetw 230  Success Begins / snaptutorial.com
Netw 230 Success Begins / snaptutorial.com
WilliamsTaylor65
 
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
Microsoft Technet France
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Jesus Guzman
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
MapReduce
MapReduceMapReduce
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
Hanborq Inc.
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
David Pavlis
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
Java File I/O Performance Analysis - Part I - JCConf 2018
Java File I/O Performance Analysis - Part I - JCConf 2018Java File I/O Performance Analysis - Part I - JCConf 2018
Java File I/O Performance Analysis - Part I - JCConf 2018
Michael Fong
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
RCCSRENKEI
 
Postgres 12 Cluster Database operations.
Postgres 12 Cluster Database operations.Postgres 12 Cluster Database operations.
Postgres 12 Cluster Database operations.
Vijay Kumar N
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
Venkata Suhas Maringanti
 
hajer
hajerhajer
hajer
ra na
 
Reduction
ReductionReduction
Reduction
Wei Shen
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
lin bao
 
A compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCoreA compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCore
Tadeu Zagallo
 

What's hot (20)

Netw 230 Success Begins / snaptutorial.com
Netw 230  Success Begins / snaptutorial.comNetw 230  Success Begins / snaptutorial.com
Netw 230 Success Begins / snaptutorial.com
 
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Java File I/O Performance Analysis - Part I - JCConf 2018
Java File I/O Performance Analysis - Part I - JCConf 2018Java File I/O Performance Analysis - Part I - JCConf 2018
Java File I/O Performance Analysis - Part I - JCConf 2018
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Postgres 12 Cluster Database operations.
Postgres 12 Cluster Database operations.Postgres 12 Cluster Database operations.
Postgres 12 Cluster Database operations.
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
 
hajer
hajerhajer
hajer
 
Reduction
ReductionReduction
Reduction
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
A compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCoreA compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCore
 

Similar to 20150207 howes-gpgpu8-dark secrets

Arun
ArunArun
SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel
NECST Lab @ Politecnico di Milano
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
Velocidex Enterprises
 
Ansiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at robloxAnsiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at roblox
Damien Garros
 
Automating Yourself Out of Trouble
Automating Yourself Out of TroubleAutomating Yourself Out of Trouble
Automating Yourself Out of Trouble
Jose De La Rosa
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
Damien Dallimore
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
Rob Gillen
 
Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009
Robert Grossman
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
Ofer Rosenberg
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
Susan Tullis
 
Supporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud services
Ahmed Abdullah
 
slides8 SharedMemory.ppt
slides8 SharedMemory.pptslides8 SharedMemory.ppt
slides8 SharedMemory.ppt
aminnezarat
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 

Similar to 20150207 howes-gpgpu8-dark secrets (20)

Arun
ArunArun
Arun
 
SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel
 
Handout3o
Handout3oHandout3o
Handout3o
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
 
Ansiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at robloxAnsiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at roblox
 
Automating Yourself Out of Trouble
Automating Yourself Out of TroubleAutomating Yourself Out of Trouble
Automating Yourself Out of Trouble
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
 
Supporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud services
 
slides8 SharedMemory.ppt
slides8 SharedMemory.pptslides8 SharedMemory.ppt
slides8 SharedMemory.ppt
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 

More from mistercteam

Preliminary xsx die_fact_finding
Preliminary xsx die_fact_findingPreliminary xsx die_fact_finding
Preliminary xsx die_fact_finding
mistercteam
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
mistercteam
 
201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu
mistercteam
 
3 673 (1)
3 673 (1)3 673 (1)
3 673 (1)
mistercteam
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
mistercteam
 
5 baker oxide (1)
5 baker oxide (1)5 baker oxide (1)
5 baker oxide (1)
mistercteam
 
The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805
mistercteam
 
Lecture14
Lecture14Lecture14
Lecture14
mistercteam
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
mistercteam
 
Gdce 2010 dx11
Gdce 2010 dx11Gdce 2010 dx11
Gdce 2010 dx11
mistercteam
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakov
mistercteam
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam
 
Mantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referenceMantle programming-guide-and-api-reference
Mantle programming-guide-and-api-reference
mistercteam
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
mistercteam
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
mistercteam
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
mistercteam
 
Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12
mistercteam
 

More from mistercteam (17)

Preliminary xsx die_fact_finding
Preliminary xsx die_fact_findingPreliminary xsx die_fact_finding
Preliminary xsx die_fact_finding
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
 
201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu
 
3 673 (1)
3 673 (1)3 673 (1)
3 673 (1)
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
 
5 baker oxide (1)
5 baker oxide (1)5 baker oxide (1)
5 baker oxide (1)
 
The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805
 
Lecture14
Lecture14Lecture14
Lecture14
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
 
Gdce 2010 dx11
Gdce 2010 dx11Gdce 2010 dx11
Gdce 2010 dx11
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakov
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
Mantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referenceMantle programming-guide-and-api-reference
Mantle programming-guide-and-api-reference
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 

20150207 howes-gpgpu8-dark secrets

  • 1. Dark Secrets: Heterogeneous Memory Models Lee Howes 2015-02-07, Qualcomm Technologies Inc.
  • 2. 2 Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT, and QWI. References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable. For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners.
  • 3. 3 Introduction – why memory consistency Introduction Current restrictions Basics of OpenCL 2.0 Heterogeneity in OpenCL 2.0 Heterogeneous memory ordering Summary
  • 4. 4 Many programming languages have coped with weak memory models These are fixed in various ways: − Platform-specific rules − Conservative compilation behavior A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement but it is a start. Many current models are based on the Data-Race-Free work − Special operations are used to maintain order − Between special instructions reordering is valid for efficiency Weak memory models
  • 5. 5 Many languages have tried to formalize their memory models − Java, C++, C… − Why? OpenCL and HSA are not exceptions − Both have developed models based on the DRF work, via C11/C++11 Ben Gaster, Derek Hower and I have collaborated on HRF-Relaxed − An extension of DRF models for heterogeneous systems − To appear in ACM TACO Unfortunately, even a stronger memory model doesn’t solve all your problems… Better memory models
  • 7. 7 Three specific issues: − Coarse grained access to data − Separating addressing between devices and the host process − Weak and poorly defined ordering controls Any of these can be worked around with vendor extensions or device knowledge Weakness in the OpenCL 1.x memory model
  • 8. 8 Host controls data Coarse host->device synchronization Data Host OpenCL Device Allocate
  • 9. 9 A single running command owns an entire allocation Coarse host->device synchronization Data Host OpenCL Device Run command
  • 10. 10 The host can access it using map/unmap operations Coarse host->device synchronization Data Host OpenCL Device Map
  • 13. 13 Separate addressing We can update memory with a pointer Host OpenCL Device data[n].ptr = &data[n+1]; data[n+1].val = 3; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1]
  • 14. 14 Separate addressing We try to read it – but what is the value of a? Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1]
  • 15. 15 Separate addressing Unfortunately, the address of data may change Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1]
  • 16. 16 Bounds on visibility Controlling memory ordering is challenging Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1
  • 17. 17 Physically, this probably means within a single core Bounds on visibility Within a group we can synchronize Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Write
  • 18. 18 Barrier operations synchronize active threads and constituent work-items Bounds on visibility Within a group we can synchronize Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Barrier
  • 19. 19 Bounds on visibility Within a group we can synchronize Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Read
  • 20. 20 Bounds on visibility Between groups we can’t Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Read?
  • 21. 21 Bounds on visibility What if we use fences? Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Write
  • 22. 22 Ensure the write completes for a given work-item Bounds on visibility What if we use fences? Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Fence
  • 23. 23 Fence at the other end Bounds on visibility What if we use fences? Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Fence
  • 24. 24 Ensure a read is after the fence Bounds on visibility Within a group we can synchronize Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 0 Memory Work item 0 Work item 1 Work item 2 Work item 3 WorkGroup 1 Read
  • 25. 25 Probably seeing a write that was written after a fence guarantees that the fence completed − The spec is not very clear on this There is no coherence guarantee − Need the write ever complete? − If it doesn’t complete, who can see it, who can know that the fence happened? Can the flag be updated without a race? − For that matter, what is a race? Spliet et al: KMA. − Weak ordering differences between platforms due to poorly defined model. Meaning of fences When did the fence happen? {{ data[n] = value; fence(…); flag = trigger; || if(flag) { fence(…); value = data[n]; }}
  • 26. 26 Basics of OpenCL 2.0 Introduction Current restrictions Basics of OpenCL 2.0 Heterogeneity in OpenCL 2.0 Heterogeneous memory ordering Summary
  • 27. 27 Sharing virtual addresses We can update memory with a pointer Host OpenCL Device data[n].ptr = &data[n+1]; data[n+1].val = 3; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1]
  • 28. 28 Sharing virtual addresses We try to read it – but what is the value of a? Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1]
  • 29. 29 Sharing virtual addresses Now the address does not change! Host OpenCL Device int a = data[n].ptr->val; assert(a == 3); struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1]
  • 30. 30 Unmap on the host, event dependency on device Sharing data – when does the value change? Coarse grained Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1] data[n].ptr = &data[n+1]; data[n+1].val = 3; clUnmapMemObject Event e
  • 31. 31 Granularity of data race covers the whole buffer Sharing data – when does the value change? Coarse grained Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1] data[n].ptr = &data[n+1]; data[n+1].val = 3; clUnmapMemObject Event e
  • 32. 32 Event dependency on device – caches will flush as necessary Sharing data – when does the value change? Fine grained Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1] data[n].ptr = &data[n+1]; data[n+1].val = 3; Event e
  • 33. 33 Data will be merged – data race at byte granularity Sharing data – when does the value change? Fine grained Host OpenCL Device int a = data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1] data[n].ptr = &data[n+1]; data[n+1].val = 3; Event e
  • 34. 34 No dispatch-level ordering necessary Sharing data – when does the value change? Fine grained with atomic support Host OpenCL Device int a = atomic-load data[n].ptr->val; struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1] data[n].ptr = &data[n+1]; Atomic-store data[n+1].val = 3;
  • 35. 35 Races at byte level – avoided using the atomics in the memory model Sharing data – when does the value change? Fine grained with atomic support Host OpenCL Device struct Foo { Foo *ptr; int val; }; struct Foo { Foo *ptr; int val; }; Data[n] Data[n+1] int a = atomic-load data[n].ptr->val; data[n].ptr = &data[n+1]; Atomic-store data[n+1].val = 3;
  • 36. 36 This is an ease of programming concern − Complex apps with complex data structures can more effectively work with data − Less work to package and repackage data It can also improve performance − Less overhead in updating pointers to convert to offsets − Less overhead in repacking data − Lower overhead of data copies when appropriate hardware support present Sharing virtual addresses
  • 37. 37 Most memory operations are entirely unordered − The compiler can reorder − The caches can reorder Unordered relations to the same location are races − Behaviour is undefined Ordering operations (atomics) may update the same location − Ordering operations order other operations relative to the ordering operation Ordering operations default to sequentially consistent semantics Sharing virtual addresses SC-for Data-Race-Free by default – release consistency for flexibility
  • 38. 38 Commonly tried on OpenCL 1.x − Not at all portable! − Due to: weak memory ordering, lack of forward progress Is the situation any better for OpenCL 2.0? − Yes, memory ordering is now under control! − Is forward progress? So what can we safely do with synchronization Take a spin wait…
  • 39. 39 Yes and no − Conceptually similar to CPU waiting on an event – ie all work-items to complete − Other app could occupy device, graphics could consume device, work may interfere with graphics − Risk of whole device being occupied elsewhere so work on the assumption that this context owns the device So what can we safely do with synchronization CPU thread waits on work-item – works? CPU threadOpenCL Work-item // Do work dependent on flag Store-release value to flag while(load-acquire flag != value) {} // Do work dependent on flag Happens-before
  • 40. 40 Probably − The spec doesn’t guarantee this − How do you know what core your work-item is on? − All you know is which work-group it is in. So what can we safely do with synchronization Work-item on one core waits for work-item on another core – works? Core 1 OpenCL Work-item Core 0 OpenCL Work-item // Do work dependent on flag Store-release value to flag while(load-acquire flag != value) {} // Do work dependent on flag Happens-before
  • 41. 41 Sometimes − On some architectures this is fine − On other architectures if both SIMD threads are on the same core: starvation − Thread 0 may never run to satisfy thread 1’s spin wait So what can we safely do with synchronization Work-item on one thread waits for work-item on another thread – works? {SIMD} thead 1 OpenCL Work-item {SIMD} thread 0 OpenCL Work-item // Do work dependent on flag Store-release value to flag while(load-acquire flag != value) {} // Do work dependent on flag Happens-before
  • 42. 42 No (well, sometimes yes, but rarely) − Fairly widely understood, but sometimes hard for new developers − If you think about the mapping to SIMD it is fairly obvious − A single program counter can’t be in two places at once – some architectures can track multiple So what can we safely do with synchronization Work-item waits for work-item in the same SIMD thread – works? {SIMD} thead 0 OpenCL Work-item 1 {SIMD} thread 0 OpenCL Work-item 0 // Do work dependent on flag Store-release value to flag while(load-acquire flag != value) {} // Do work dependent on flag Happens-before
  • 43. 43 Maybe, maybe not − It depends entirely on where the work-items are mapped in the group − Same thread – no − Different threads – maybe − The developer can often tell, but it isn’t portable and the compiler can easily break it So what can we safely do with synchronization Work-items in a work-group Work-group 0 OpenCL Work-item Work-group 0 OpenCL Work-item // Do work dependent on flag Store-release value to flag while(load-acquire flag != value) {} // Do work dependent on flag Happens-before
  • 44. 44 Maybe, maybe not − It depends entirely on where the work-groups are placed on the device − Two work-groups on the same core – you have the thread to thread case − Two work-groups on different cores – it probably works − No way to control the mapping! So what can we safely do with synchronization Work-groups Work-group 1 OpenCL Work-item Work-group 0 OpenCL Work-item // Do work dependent on flag Store-release value to flag while(load-acquire flag != value) {} // Do work dependent on flag Happens-before
  • 45. 45 Realistically − Spin waits on other OpenCL work-items are just not portable − Very limited use of the memory model So what can you do? − Communicating that work has passed a certain point − Updating shared data buffers with flags − Lock-free FIFO data structures to share data − OpenCL 2.0’s sub-group extension provides limited but important forward progress guarantees So what can we safely do with synchronization Overall, a fairly poor situation
  • 46. 46 Heterogeneity in OpenCL 2.0 Introduction Current restrictions Basics of OpenCL 2.0 Heterogeneity in OpenCL 2.0 Heterogeneous memory ordering Summary
  • 47. 47 Even acquire-release consistency can be expensive In particular, always synchronizing the whole system is expensive − A discrete GPU does not want to always make data visible across the PCIe interface − A single core shuffling data in local cache does not want to interfere with DRAM-consuming throughput tasks OpenCL 2 optimizes this using the concept of synchronization scopes Lowering implementation cost
  • 48. 48 WI WI SG SG WG WG Not all communication is global – so bound it Hierarchical memory synchronization Synchronization is expensive OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core
  • 49. 49 WI WI SG SG WG WG Sub-group scope Hierarchical memory synchronization Scopes! OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core
  • 50. 50 WI WI SG SG WG WG Sub-group scope; Work-group scope Hierarchical memory synchronization Scopes! OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core
  • 51. 51 WI WI SG SG WG WG Sub-group scope; Work-group scope; Device scope Hierarchical memory synchronization Scopes! OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core
  • 52. 52 WI WI SG SG WG WG Sub-group scope; Work-group scope; Device scope; All-SVM-Devices scope Hierarchical memory synchronization Scopes! OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core
  • 53. 53 WI WI SG SG WG WG Hierarchical memory synchronization Release to the appropriate scope, acquire from the matching scope OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core store-release work-group-scope x load-acquire work-group-scope x
  • 54. 54 WI WI SG SG WG WG Hierarchical memory synchronization Release to the appropriate scope, acquire from the matching scope OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core store-release device-scope x load-acquire device-scope x
  • 55. 55 WI WI SG SG WG WG Hierarchical memory synchronization If scopes do not reach far enough, this is a race OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core store-release work-group-scope x load-acquire work-group-scope x
  • 56. 56 Allows aggressive hardware optimization to coherence traffic − GPU coherence in particular is often expensive – GPUs generate a lot of memory traffic The memory model defines synchronization rules in terms of scopes − Insufficient scope is a race − Non-matching scopes race (in the OpenCL 2.0 model, this isn’t necessary) Scoped synchronization
  • 57. 57 Four address spaces in OpenCL 2.0 − Constant and private are not relevant for communication − Global and local maintain separate orders in the memory model Synchronization, acquire/release behavior etc apply only to local OR global, not both − The global release->acquire order below does not order the updates to a! Address space orderings {SIMD} thead 1 OpenCL Work-item {SIMD} thread 0 OpenCL Work-item local-store a = 2 Global-store-release value to flag while(global-load-acquire flag != value) {} assert local-load a==2 Local-happens-before
  • 58. 58 Take an example like this: − void updateRelease(int *flag, int *data); If I lock the flag, do I safely update data or not? − The function takes pointers with no address space − The ordering depends on the address space − The address space depends on the type of the pointers at the call site, or even earlier! There are ways to get the fence flags, but it is messy, care must be taken Multiple orderings in the presence of generic pointers
  • 59. 59 Heterogeneous memory ordering Introduction Current restrictions Basics of OpenCL 2.0 Heterogeneity in OpenCL 2.0 Heterogeneous memory ordering Summary
  • 60. 60 Sequential consistency aims to be a simple, easy to understand, model − Behave as if the ordering operations are simply interleaved In the OpenCL model, sequential consistency is guaranteed only in specific cases: − All memory_order_seq_cst operations have the scope memory_scope_all_svm_devices and all affected memory locations are contained in system allocations or fine grain SVM buffers with atomics support − All memory_order_seq_cst operations have the scope memory_scope_device and all affected memory locations are not located in system allocated regions or fine-grain SVM buffers with atomics support Consider what this means… − You start modifying your app to have more fine-grained sharing of some structures − Suddenly your atomics are not sequentially consistent at all! − What about SC operations to local memory? Limits of sequential consistency
  • 61. 61 These are data-race-free memory models They only guarantee ordering in the absence of races − So we only actually order things we can observe! Order can be relaxed between atomics. Is such a limit necessary? First, scopes… WI WI SG SG WG WG GPU Device Core WG WG SG SG WI WI Core WIWI SGSG SGSG WIWI WG WG
  • 62. 62 In one part of the execution, we have SC operations at device scope − Let’s assume this is valid Is such a limit necessary? First, scopes… WI WI SG SG WG WG GPU Device Core WG WG SG SG WI WI Core WIWI SGSG SGSG WIWI WG scope SC Operations. Ordered SC. WG WG
  • 63. 63 Elsewhere we have another set of SC operations Is such a limit necessary? First, scopes… WI WI SG SG WG WG GPU Device Core WG WG SG SG WI WI Core WIWI SGSG SGSG WIWI WG scope SC Operations. Ordered SC. WG scope SC Operations. Ordered SC. WG WG
  • 64. 64 Any access across from one work-group to another is a race − It is equivalent to a non-atomic operation − Therefore it is invalid Is such a limit necessary? First, scopes… WI WI SG SG WG WG GPU Device Core WG WG SG SG WI WI Core WIWI SGSG SGSG WIWI WG scope SC Operations. Ordered SC. WG WG An access like this is a race WG scope SC Operations. Ordered SC.
  • 65. 65 In Hower et. al.’s work on Heterogeneous-Race-Free memory models this is made explicit Sequential consistency can be maintained with scopes Access to invalid scope − Is unordered − Is not observable − So this is still valid sequential consistency: everything observable, ie that is race-free, is SC Making sequential consistency a partial order
  • 66. 66 We can apply SC semantics here for the same reason − Actions to coarse-grained memory is not visible to other clients − However – coarse buffers don’t fit cleanly in a hierarchical model In Gaster et al. (to appear ACM TACO) we use the concept of observability as a memory model extension − “At any given point in time a given location will be available in a particular set of scope instances out to some maximum instance and by some set of actors. Only memory operations that are inclusive with that maximal scope instance will observe changes to those locations.” Observability can migrate using API actions − Map, unmap, and event dependencies Extending to coarse-grained memory
  • 67. 67 Observability Initial state WI WI SG SG WG WG OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core Memory Allocation Observability bounds – device scope on the GPU
  • 68. 68 Observability Map operation WI WI SG SG WG WG OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core Memory Allocation Observability moves to device scope on the CPU
  • 69. 69 Observability Unmap WI WI SG SG WG WG OpenCL System GPU Device Core WG WG SG SG WI WI CPU Device DSPDevice Core Core CoreCore Core Memory Allocation Unmap will transfer dependence back to an OpenCL device
  • 70. 70 We can cleanly put scopes and coarse memory in the same memory consistency model − It is more complicated than, say, C++ − It is practical to formalize, and hopefully easy to explain There is no need for quirky corner cases We merely rethink slightly − Instead of a total order S for all SC operations we have a total order Sa for each agent a of all SC operations observable by that agent such that all these orders are consistent with each other. − This is a relatively small step from the DRF idea, which is effectively that non-atomic operations are not observable, and thus do not need to be strictly ordered. The point
  • 71. 71 Separate orders for local and global memory are likely to be painful for programmers Do we need them? − We have formalized the concept (also in the TACO paper) using multiple happens-before orders that subset memory operations − We also describe bridging-synchronization-order as a formal way to allow join points − It is messy as a formalization. The harm to the developer is probably significant. Joining address spaces
  • 72. 72 However! − Hardware really does have different instructions and different timing to access different memories − Can all hardware efficiently synchronize these memory interfaces? Joining address spaces Processing core Local memory L1 Cache L2 Cache DRAM Texture Cache
  • 73. 73 However! − Hardware really does have different instructions and different timing to access different memories − Can all hardware efficiently synchronize these memory interfaces? Joining address spaces Processing core Local memory L1 Cache L2 Cache DRAM Texture Cache Entirely different interfaces from the processing core
  • 74. 74 However! − Hardware really does have different instructions and different timing to access different memories − Can all hardware efficiently synchronize these memory interfaces? − We will have to see how this plays out − Consider this a warning to take care if you try to use this aspect of OpenCL 2.0 Joining address spaces Processing core Local memory L1 Cache L2 Cache DRAM Texture Cache Entirely different interfaces from the processing core
  • 75. 75 Summary Introduction Current restrictions Basics of OpenCL 2.0 Heterogeneity in OpenCL 2.0 Heterogeneous memory ordering Summary
  • 76. 76 Like mainstream languages, heterogeneous programming models are adopting firm memory models Without fundamental execution model guarantees the usefulness is limited We are making progress on both these counts Things are improving
  • 77. 77 Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT, and QWI. References to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable. For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. Thank you Follow us on: