SlideShare a Scribd company logo
Memory ordering
2014
issue.hsu@gmail.com
SYNCHRONIZATION
2
Background
•Synchronization of multithread
program
– Mutex (mutual exclusion)
• Ensuring that no two processes or threads are in their critical section
at the same time
– Here, a critical section refers to a period of time when the process
accesses a shared resource, such as shared memory
3
Background
– Semaphore
• A mutex is essentially the same thing as a binary semaphore, and
sometimes uses the same basic implementation
• However, the term "mutex" is used to describe a construct which
prevents two processes from accessing a shared resource
concurrently
• The term "binary semaphore" is used to describe a construct which
limits access to a single resource
• In many cases a mutex has a concept of an “owner”
– the process which locked the mutex is the only process allowed to
unlock it. In contrast, semaphores generally do not have this
restriction
– Semaphore vs. mutex
• http://www.kernel.org/doc/Documentation/mutex-design.txt
4
Synchronization
and mutex
Common synchronization methods
5
Reference:
http://msdn.microsoft.com/e
n-us/library/ms810047.aspx
Windows mutex mechanisms
Type of mutex IRQL considerations Recursion and thread details
Interrupt spin lock Acquisition raises IRQL to DIRQ and returns
previous IRQL to caller.
Not recursive. Release on same
thread as acquire.
Spin lock Acquisition raises IRQL to
DISPATCH_LEVEL and returns previous
IRQL to caller.
Not recursive. Release on same
thread as acquire.
Queued spin lock Acquisition raises IRQL to
DISPATCH_LEVEL and stores previous
IRQL in lock owner handle.
Not recursive. Release on same
thread as acquire.
Fast mutex Acquisition raises IRQL to APC_LEVEL and
stores previous IRQL in lock.
Not recursive. Release on same
thread as acquire.
Kernel mutex (a
kernel dispatcher
object)
Enters critical region upon acquisition and
leaves critical region upon release.
Recursive. Release on same
thread as acquire.
Synchronization
event (a kernel
dispatcher object)
Acquisition does not change IRQL. Wait at
IRQL <= APC_LEVEL and signal at IRQL
<= DISPATCH_LEVEL.
Not recursive. Release on the
same thread or on a different
thread.
Unsafe fast mutex Acquisition does not change IRQL. Acquire
and release at IRQL <= APC_LEVEL.
Not recursive. Release on same
thread as acquire.
Synchronization
method
Description Windows mechanisms
Interlocked
operations
Provides atomic logical,
arithmetic, and list
manipulation operations that
are both thread-safe and
multiprocessor safe.
InterlockedXxx and
ExInterlockedXxx routines
Mutexes Provides (mutually) exclusive
access to memory.
Spin locks, fast mutexes,
kernel mutexes,
synchronization events
Shared/exclusive
lock
Allows one thread to write or
many threads to read the
protected data.
Executive resources
Counted semaphore Allows a fixed number of
acquisitions.
Semaphores
What is wrong with Mutexes?
• Mutexes are perfectly fine, but you have a problem if there is lock
contention
– If you want your algorithm to be fast, you want to use the
available cores as much as possible instead of letting them sleep
– A thread can hold a mutex and be de-scheduled by the CPU
(because of a cache miss or its time slice is over), then all the
threads that want to acquire this mutex will be blocked
– And if you have a lot of blocking, the OS also needs to do more
context switches which are expensive because they clear the
caches
6
Reference:
http://woboq.com/blog/introduction
-to-lockfree-programming.html
What is wrong with Mutexes?
• Problems with locking
– Deadlock
– Priority Inversion
• Low-priority processes hold a lock required by a higher priority process
– Convoying
• All the other processes slow to the speed of the slowest one
– Async-signal-safety
• Signal handlers can’t use lock-based primitives
– Kill-tolerant availability
• What happens if threads are killed/crash while holding locks
– Pre-emption tolerance
• What happens if you’re pre-empted holding a lock
– Overall performance
7
Reference:
http://www.cs.cmu.edu/~410-
s05/lectures/L31_LockFree.pdf
So how can we do it without locking?
• Lock-free Programming
– Thread-safe access to shared data without the use of
synchronization primitives such as mutexes
– Practical with hardware support
• Modern CPUs have something called atomic operations
• The use of shared memory and an atomic instruction provides the
mutual exclusion
8
Atomic operation
• Atomic operation
– Processors have instructions that can be used to implement lock-
free and wait-free algorithms
• Atomic read-write
• Atomic swap, also called XCHG
• Test-and-set
• Fetch-and-add
• Compare-and-swap (CAS)
– Compare and Exchange (CMPXCHG) instruction in the x86 and
Itanium architectures
– ABA problem
» http://woboq.com/blog/introduction-to-lockfree-programming.html
9
Reference:
http://en.wikipedia.org/wiki/Atomic_operation
http://en.wikipedia.org/wiki/Read-modify-write
Atomic operation
• Load-Link/Store-Conditional
– The LDREX and STREX instructions in ARM split the operation of
atomically updating memory into two separate steps. Together, they provide
atomic updates in conjunction with exclusive monitors that track exclusive
memory accesses. Load-Exclusive and Store-Exclusive must only access
memory regions marked as Normal
– For example
» LDREX R1, [R0] performs a Load-Exclusive from the address in R0, places the value into
R1 and updates the exclusive monitor(s).
» STREX R2, R1, [R0] performs a Store-Exclusive operation to the address in R0,
conditionally storing the value from R1 and indicating success or failure in R2.
10
Reference:
http://infocenter.arm.com/help/topic/co
m.arm.doc.dht0008a/ch01s02s01.html
http://infocenter.arm.com/help/topic/co
m.arm.doc.dht0008a/CJAGCFAF.html
Exclusive accesses to memory locations
marked as Non-shareable are checked
only against this local monitor. Exclusive
accesses to memory locations marked as
Shareable are checked against both the
local monitor and the global monitor.
Atomic operation
• GCC Built-in functions for atomic memory access
– http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Atomic-Builtins.html
• Atomic operations supported in Linux Kernel
– https://www.kernel.org/doc/Documentation/atomic_ops.txt
• Atomic operations supported in C11/C++11
– C11 defines a new _Atomic() type specifier. You can declare an
atomic integer like this:
_Atomic(int) counter;
– C++11 moves this declaration into the standard library:
#include <atomic>
std::atomic<int> counter;
11
Reference:
http://www.informit.com/articles
/article.aspx?p=1832575
Atomic operation
• Is atomic operation enough?
• Linux-v3.7.8/arch/arm/include/asm/atomic.h
12
static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
{
unsigned long oldval, res;
smp_mb();
do {
__asm__ __volatile__("@ atomic_cmpxchgn"
"ldrex %1, [%3]n"
"mov %0, #0n"
"teq %1, %4n"
"strexeq %0, %5, [%3]n"
: "=&r" (res), "=&r" (oldval), "+Qo" (ptr->counter)
: "r" (&ptr->counter), "Ir" (old), "r" (new)
: "cc");
} while (res);
smp_mb();
return oldval;
}
Reference:
http://lxr.linux.no/#linux+v3.7.8/arch/ar
m/include/asm/atomic.h#L115
Before talking about memory
barrier, let’s see memory ordering
first.
Memory barrier
MEMORY ORDERING
CONCEPT
13
Memory ordering
• Memory ordering - memory access ordering
– Program order
• the order of the program’s object code as seen by the CPU, which might differ from
the order in the source code due to compiler optimizations
– Execution order
• It can differ from program order due to both compiler and CPU implementation
optimizations
– Perceived order
• It can differ from the execution order due to caching, interconnect, and memory-
system optimizations
• Why memory reordering
– Performance!
14
Reference:
http://www.rdrop.com/users/paulmck/sca
lability/paper/ordering.2007.09.19a.pdf
http://preshing.com/20120930/weak-vs-
strong-memory-models
Memory consistency models
• Memory models – memory consistency models
• Sequential consistency
– all reads and all writes are in-order
• Relaxed consistency
– Some types of reordering are allowed
• Loads can be reordered after loads (for better working of cache coherency,
better scaling)
• Loads can be reordered after stores
• Stores can be reordered after stores
• Stores can be reordered after loads
• Weak consistency
– Reads and Writes are arbitrarily reordered, limited only by explicit
memory barriers
15
Weak VS. Strong memory model
16
Reference:
http://preshing.com/20120930/
weak-vs-strong-memory-models
Memory ordering in some architectures
17
SPARC TSO = total-store order (default)
SPARC RMO = relaxed-memory order (not supported on recent
CPUs)
SPARC PSO = partial store order (not supported on recent CPUs)
Type Alpha ARMv7 PA-RISC POWER
SPARC
RMO
SPARC
PSO
SPARC
TSO
x86
x86
oostore
AMD64 IA-64 zSeries
Loads reordered after loads Y Y Y Y Y Y Y
Loads reordered after stores Y Y Y Y Y Y Y
Stores reordered after stores Y Y Y Y Y Y Y Y
Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y
Atomic reordered with loads Y Y Y Y Y
Atomic reordered with stores Y Y Y Y Y Y
Dependent loads reordered Y
Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y
Reference:
http://en.wikipedia.org/wiki/Memory_ordering
Types of Memory Barrier
• #LoadLoad
• #StoreStore
• #LoadStore
• #StoreLoad
– A StoreLoad barrier ensures that all stores performed before the barrier are visible to other
processors, and that all loads performed after the barrier receive the latest value that is visible at the
time of the barrier
18
Reference:
http://preshing.com/20120710/memory-
barriers-are-like-source-control-operations
if (IsPublished) // Load and check shared flag
{
LOADLOAD_FENCE(); // Prevent reordering of loads
return Value; // Load published value
}
Value = x; // Publish some data
STORESTORE_FENCE();
IsPublished = 1; // Set shared flag to indicate availability of data
Memory barrier in compiler
• GCC compiler memory barrier
– These barriers prevent a compiler from reordering instructions,
they do not prevent reordering by CPU.
• GCC support for hardware memory barriers
– This builtin issues a full memory barrier.
19
Reference:
http://en.wikipedia.org/wiki/Memory_ordering
http://gcc.gnu.org/onlinedocs/gcc-
4.6.3/gcc/Atomic-Builtins.html
asm volatile("" ::: "memory");
or
__asm__ __volatile__ ("" ::: "memory");
_sync_synchronize (...);
Memory barriers in Linux kernel
• General barrier
– barrier()
• Compiler barrier only. The compiler will not reorder memory accesses from one side of this
statement to the other. This has no effect on the order that the processor actually executes
the generated instructions.
• Mandatory barriers
– mb()
• A full system memory barrier. All memory operations before the mb() in the instruction
stream will be committed before any operations after the mb() are committed. This ordering
will be visible to all bus masters in the system. It will also ensure the order in which
accesses from a single processor reaches slave devices.
– rmb()
• Like mb(), but only guarantees ordering between read accesses. That is, all read
operations before an rmb() will be committed before any read operations after the rmb().
– wmb()
• Like mb(), but only guarantees ordering between write accesses. That is, all write
operations before a wmb() will be committed before any write operations after the wmb().
20
Reference:
http://blogs.arm.com/software-
enablement/448-memory-access-ordering-
part-2-barriers-and-the-linux-kernel/
http://www.kernel.org/doc/Documentation/
memory-barriers.txt
Memory barriers in Linux kernel
• SMP conditional barriers
– smp_mb()
• Similar to mb(), but only guarantees ordering between cores/processors within an
SMP system. All memory accesses before the smp_mb() will be visible to all cores
within the SMP system before any accesses after the smp_mb().
– smp_rmb()
• Like smp_mb(), but only guarantees ordering between read accesses.
– smp_wmb()
• Like smp_mb(), but only guarantees ordering between write accesses.
– SMP barriers are a subset of mandatory barriers, not a superset.
• An SMP barrier cannot replace a mandatory barrier, but a mandatory barrier can
replace an SMP barrier.
• Implicit barriers
– Locking constructs in the kernel act as implicit SMP barriers, in the same way
as pthread synchronization operations do in user space.
– I/O accessor macros (readb(), iowrite32()) for the ARM architecture act as
explicit memory barriers when kernel is compiled with
CONFIG_ARM_DMA_MEM_BUFFERABLE. This was added in linux-2.6.35.
• arch/arm/include/asm/io.h
• arch/arm/mm/Kconfig
21
Reference:
https://www.kernel.org/doc/Documentatio
n/memory-barriers.txt
MEMORY ORDERING IN ARM
22
Memory ordering in ARM Architecture
• Memory types
– Normal memory
• Normal memory is effectively for all of your data and executable code
• This memory type permits speculative reads, merging of accesses and repeating of
reads without side effects
• Accesses to Normal memory can always be buffered, and in most situations they
are also cached - but they can be configured to be uncached
• There is no implicit ordering of Normal memory accesses
– Device memory and Strongly-ordered memory
• Used with memory mapped peripherals or other control registers
• Processors implementing the LPAE treat Device and Strongly-ordered memory
regions identically
• ARMv7-A processors that do not implement the LPAE can set device memory to be
Shareable or Non-shareable
• Accesses to these types of memory must happen exactly the number of times that
executing the program suggests they should
• There is no guarantee about ordering between memory accesses to different
devices, or usually between accesses of different memory types
23
Reference:
http://blogs.arm.com/software-enablement/594-
memory-access-ordering-part-3-memory-access-
ordering-in-the-arm-architecture/
Memory ordering in ARM Architecture
• Arranges of ARM Memory Types
– Normal
• Shareable or Non-shareable
• Cacheable or Non-cacheable
– Device (w/o LPAE)
• Shareable or Non-shareable
– Device (w LPAE)
• Always shareable
– Strongly-ordered
• Always shareable
• Have to wait slave’s access ACK
24
ARM ® Architecture Reference
Manual
ARMv7-A and ARMv7-R edition
Memory ordering in ARM Architecture
• Figure A3-5 shows the memory ordering between two explicit accesses A1 and A2,
where A1 occurs before A2 in program order
 < Accesses must arrive at any particular memory-mapped peripheral or block of
memory in program order, that is, A1 must arrive before A2. There are no ordering
restrictions about when accesses arrive at different peripherals or blocks of
memory.
 – Accesses can arrive at any memory-mapped peripheral or block of memory in
any order.
25
Memory ordering in ARM Architecture
• Barriers
– Barriers were introduced progressively into the ARM architecture
• Some ARMv5 processors, such as the ARM926EJ-S, implemented a Drain Write
Buffer cp15 operation, which halted execution until any buffered writes had drained
into the external memory system
• With the introduction of the ARMv6 memory model, this operation was redefined in
more architectural terms and became the Data Synchronization Barrier
– ARMv6 also introduced the new Data Memory Barrier and Flush Prefetch Buffer
cp15 operations
• ARMv7 evolved the memory model somewhat, extending the meaning of the
barriers - and the Flush Prefetch Buffer operation was renamed the Instruction
Synchronization Barrier
• ARMv7 also allocated dedicated instruction encodings for the barrier operations
– Use of the cp15 operations is now deprecated and software targeting ARMv7 or
later should use the DMB, DSB and ISB mnemonics.
• And finally, ARMv7 extended the Shareability concept to cover both Inner-shareable
and Outer-shareable domains
– This together with AMBA4 ACE gives us barriers that propagate into the memory
system
26
Memory ordering in ARM Architecture
– Instruction Synchronization Barrier (ISB)
• The ISB ensures that any subsequent instructions are fetched anew
from cache in order that privilege and access is checked with the
current MMU configuration
– It is used to ensure any previously executed context changing
operations will have completed by the time the ISB completed
• Access type and domain are not really relevant for this barrier
– It is not used in any of the Linux memory barrier primitives, but
appears in memory management, cache control and context
switching code
27
Memory ordering in ARM Architecture
– Data Memory Barrier (DMB)
• DMB prevents reordering of data accesses instructions across itself
– All data accesses by this processor/core before the DMB will be
visible to all other masters within the specified shareability domain
before any of the data accesses after it
– It also ensures that any explicit preceding data/unified cache
maintenance operations have completed before any subsequent
data accesses are executed
– The DMB instruction takes two optional parameters: an operation
type (stores only - 'ST' - or loads and stores) and a domain
– The default operation type is loads and stores and the default
domain is System
• In the Linux kernel, the DMB instruction is used for the smp_*mb()
macros
28
Memory ordering in ARM Architecture
– Data Synchronization Barrier (DSB)
• DSB enforces the same ordering as the Data Memory Barrier
– But it also blocks execution of any further instructions until
synchronization is complete
– It also waits until all cache and branch predictor maintenance
operations have completed for the specified shareability domain
– If the access type is load and store then it also waits for any TLB
maintenance operations to complete.
• In the Linux kernel, the DSB instruction is used for the *mb() macros.
29
Domain
Abbreviatio
n
Description
Non-shareable NSH
A domain consisting only of the local agent. Accesses that never need to be synchronized with other
cores, processors or devices. Not normally used in SMP systems.
Inner
Shareable
ISH
A domain potentially shared by multiple agents, but usually not all agents in the system.
A system can have multiple Inner Shareable domains. An operation that affects one Inner Shareable
domain does not affect other Inner Shareable domains in the system.
Outer
Shareable
OSH
A domain almost certainly shared by multiple agents, and quite likely consisting of several Inner
Shareable domains. An operation that affects an Outer Shareable domain also implicitly affects all
Inner Shareable domains within it.
For processors such as the Cortex-A15 MPCore that implement the LPAE, all Device memory
accesses are considered Outer Shareable. For other processors, the shareability attribute can be
explicitly set (to shareable or non-shareable).
Full system SY
An operation on the full system affects all agents in the system; all Non-shareable regions, all Inner
Shareable regions and all Outer Shareable regions. Simple peripherals such as UARTs, and several
more complex ones, are not normally necessary to have in a restricted shareability domain.
Memory ordering in ARM Architecture
• Shareability domains
– Shareability domains define "zones" within the bus topology within which memory
accesses are to be kept consistent (taking place in a predictable way) and
potentially coherent (with hardware support)
– Outside of this domain, observers might not see the same order of memory
accesses as inside it
30
Reference:
http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/CIHGHHIE.html
ARMv7
Memory ordering in ARM Architecture
31
Allocated values for the data barriers (DMB/DSB)ARMv8
Memory ordering in ARM Architecture
• The shareability domains example
32
4 cores per cluster,
2 clusters per chip
ACQUIRE AND RELEASE
33
Memory model supported in C++11
• C++ Memory model
– Sequential consistent/acquire-release/relaxed
• http://en.cppreference.com/w/cpp/atomic/memory_order
• http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
34
Acquire and Release Semantics
• ARMv8 AArch64/AArch32 support load-acquire/store-release
instructions
– The Load-Acquire/Store-Release instructions can remove the requirement to use
the explicit DMB memory barrier instruction
35
Reference:
http://preshing.com/20120913/acq
uire-and-release-semantics
http://www.arm.com/files/downloa
ds/ARMv8_Architecture.pdf
Acquire semantics is a property which can only apply to
operations which read from shared memory. The operation is
then considered a read-acquire. Acquire semantics prevent
memory reordering of the read-acquire with any read or write
operation which follows it in program order.
Release semantics is a property which can only apply to
operations which write to shared memory. The operation is then
considered a write-release. Release semantics prevent memory
reordering of the write-release with any read or write operation
which precedes it in program order.
Acquire and Release Semantics
• An demo example
36
//Shared global variables
int A = 0;
int Ready = 0;
//Thread 1
A = 42;
Ready = 1;
//Thread 2
int r1 = Ready;
int r2 = A;
//Possible results of r1, r2
r1 =0 r2 = 0
r2 = 42
r1 = 1 r2 = 0
r2 = 42
//Shared global variables
int A = 0;
Atomic<int> Ready = 0;
//Thread 1
A = 42;
Ready.store(1,
memory_order_release);
//Thread 2
int r1 =
Ready.load(memory_ord
er_acquire);
int r2 = A;
//Possible results of r1, r2
r1 =0 r2 = 0
r2 = 42
r1 = 1 r2 = 42
Acquire and Release Semantics
• A Write-Release Can Synchronize-With a Read-Acquire
37
// Thread 1
void SendTestMessage(void* param)
{
// Copy to shared memory using non-atomic stores.
g_payload.tick = clock();
g_payload.str = "TestMessage";
g_payload.param = param;
// Perform an atomic write-release to indicate that the message is ready.
g_guard.store(1, std::memory_order_release);
}
// Thread 2
bool TryReceiveMessage(Message& result)
{
// Perform an atomic read-acquire to check whether the message is ready.
int ready = g_guard.load(std::memory_order_acquire);
if (ready != 0)
{
// Yes. Copy from shared memory using non-atomic loads.
result.tick = g_payload.tick;
result.str = g_msg_str;
result.param = g_payload.param;
return true;
}
// No.
return false;
}
Reference:
http://preshing.com/20130823/
the-synchronizes-with-relation/
VOLATILE
38
Volatile V.S. memory-order/atomic
• What does the volatile keyword mean?
39
Reference:
http://www.drdobbs.com/parallel/vola
tile-vs-volatile/212701484
Volatile V.S. memory-order/atomic
• C programmers have often taken volatile to mean that the variable could
be changed outside of the current thread of execution
– as a result, they are sometimes tempted to use it in kernel code
when shared data structures are being used
– In other words, they have been known to treat volatile types as a sort
of easy atomic variable, which they are not
– The use of volatile in kernel code is almost never correct
• The key point to understand with regard to volatile is that its purpose is to
suppress optimization, which is almost never what one really wants to do
• In the kernel, one must protect shared data structures against unwanted
concurrent access, which is very much a different task
• Like volatile, the kernel primitives which make concurrent access to data
safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent
unwanted optimization. If they are being used properly, there will be no
need to use volatile as well
40
Reference:
https://www.kernel.org/doc/Document
ation/volatile-considered-harmful.txt
Volatile V.S. memory-order/atomic
• To safely write lock-free code that communicates between threads without using
locks
– prefer to use ordered atomic variables
– Java/.NET volatile, C++0x atomic<T>, and C-compatible atomic_T
• To safely communicate with special hardware or other memory that has unusual
semantics
– use un-optimizable variables: ISO C/C++ volatile
– Remember that reads and writes of these variables are not necessarily
atomic
• To protect shared data structures against unwanted concurrent access in kernel
code
– use kernel concurrent access primitives, like spinlocks, mutexes, memory
barriers
• Finally, to express a variable that both has unusual semantics and has any or all
of the atomicity and/or ordering guarantees needed for lock-free coding
– only the ISO C++11 Standard provides a direct way to spell it: volatile
atomic<T>
41
USAGE OF MEMORY BARRIER
42
Usage of memory barrier
instructions
• In what situations might I need to insert memory barrier instructions?
– Mutexes
43
Reference:
http://infocenter.arm.com/help/topic/
com.arm.doc.genc007826/Barrier_Lit
mus_Tests_and_Cookbook_A08.pdf
http://infocenter.arm.com/help/topic/
com.arm.doc.faqs/ka14041.html
LOCKED EQU 1
UNLOCKED EQU 0
lock_mutex
; Is mutex locked?
LDREX r1, [r0] ; Check if locked
CMP r1, #LOCKED ; Compare with "locked"
WFEEQ ; Mutex is locked, go into standby
BEQ lock_mutex ; On waking re-check the mutex
; Attempt to lock mutex
MOV r1, #LOCKED
STREX r2, r1, [r0] ; Attempt to lock mutex
CMP r2, #0x0 ; Check whether store completed
BNE lock_mutex ; If store failed, try again
DMB ; Required before accessing protected resource
BX lr
unlock_mutex
DMB ; Ensure accesses to protected resource have completed
MOV r1, #UNLOCKED ; Write "unlocked" into lock field
STR r1, [r0]
DSB ; Ensure update of the mutex occurs before other CPUs wake
SEV ; Send event to other CPUs, wakes any CPU waiting on using WFE
BX lr
Usage of memory barrier instructions
– Memory Remapping
• Consider a situation where your reset handler/boot code lives in Flash memory (ROM),
which is aliased to address 0x0 to ensure that your program boots correctly from the vector
table, which normally resides at the bottom of memory (see left-hand-side memory map).
• After you have initialized your system, you may wish to turn off the Flash memory alias so
that you can use the bottom portion of memory for RAM (see right-hand-side memory
map). The following code (running from the permanent Flash memory region) disables the
Flash alias, before calling a memory block copying routine (e.g., memcpy) to copy some
data from to the bottom portion of memory (RAM).
44
MOV r0, #0
MOV r1, #REMAP_REG
STR r0, [r1] ; Disable Flash alias
BL block_copy_routine() ; Block copy code into RAM
BL copied_routine() ; Execute copied routine (now in RAM)
DMB ; Ensure above str completion with DMB
DSB ; Ensure block copy is completed with DSB
ISB ; Ensure pipeline flush with ISB
Question
Usage of memory barrier instructions
– Self-modifying code
– If the memory you are performing the block copying routine on is marked as 'cacheable'
the instruction cache will need to be invalidated so that the processor does not execute
any other 'cached' code.
– For "write-back" regions the data cache must be cleaned before the instruction cache
invalidate.
45
Overlay_manager
; ...
BL block_copy ; Copy new routine from ROM to RAM
B relocated_code ; Branch to new routine
Overlay_manager
; ...
BL block_copy ; Copy new routine from ROM to RAM
data_cache_clean ; Clean the cache so that the new routine is written out to memory
icache_and_pb_invalidate ; Invalidate the instruction cache and branch predictor so that the
; old routine is no longer cached
B relocated_code ; Branch to new routine
DSB ; Ensure block copy has completed
ISB ; Flush pipeline to ensure processor fetches new instructions
DSB ; Ensure data cache clean has completed
DSB ; Ensure invalidate has completed
ISB ; Flush pipeline to ensure processor fetches new instructions

More Related Content

What's hot

Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
Adrian Huang
 
IP PCIe
IP PCIeIP PCIe
IP PCIe
SILKAN
 
Pcie basic
Pcie basicPcie basic
Pcie basic
Saifuddin Kaijar
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
shimosawa
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Linaro
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
shimosawa
 
Session 8,9 PCI Express
Session 8,9 PCI ExpressSession 8,9 PCI Express
Session 8,9 PCI ExpressSubhash Iyer
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
Hisaki Ohara
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
Yi-Hsiu Hsu
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
Adrian Huang
 
DDR3
DDR3DDR3
PCIe DL_layer_3.0.1 (1)
PCIe DL_layer_3.0.1 (1)PCIe DL_layer_3.0.1 (1)
PCIe DL_layer_3.0.1 (1)
Rakeshkumar Sachdev
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
Hao-Ran Liu
 
Q4.11: Introduction to eMMC
Q4.11: Introduction to eMMCQ4.11: Introduction to eMMC
Q4.11: Introduction to eMMC
Linaro
 
COSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingCOSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem porting
Eric Lin
 
How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?
Sameh El-Ashry
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Adrian Huang
 
eMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overvieweMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overview
VijayGESYS
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
shimosawa
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
Adrian Huang
 

What's hot (20)

Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
IP PCIe
IP PCIeIP PCIe
IP PCIe
 
Pcie basic
Pcie basicPcie basic
Pcie basic
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Session 8,9 PCI Express
Session 8,9 PCI ExpressSession 8,9 PCI Express
Session 8,9 PCI Express
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
 
DDR3
DDR3DDR3
DDR3
 
PCIe DL_layer_3.0.1 (1)
PCIe DL_layer_3.0.1 (1)PCIe DL_layer_3.0.1 (1)
PCIe DL_layer_3.0.1 (1)
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Q4.11: Introduction to eMMC
Q4.11: Introduction to eMMCQ4.11: Introduction to eMMC
Q4.11: Introduction to eMMC
 
COSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingCOSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem porting
 
How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
 
eMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overvieweMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overview
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 

Viewers also liked

ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory Systems
Anh Dung NGUYEN
 
Intel vmcs-shadowing-paper
Intel vmcs-shadowing-paperIntel vmcs-shadowing-paper
Intel vmcs-shadowing-paper
Ahmed Sallam
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
Sathish Arumugasamy
 
Lock free programming - pro tips devoxx uk
Lock free programming - pro tips devoxx ukLock free programming - pro tips devoxx uk
Lock free programming - pro tips devoxx uk
Jean-Philippe BEMPEL
 
ARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARMARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARM
Anh Dung NGUYEN
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overviewSunil Thorat
 
ARM AAE - System Issues
ARM AAE - System IssuesARM AAE - System Issues
ARM AAE - System Issues
Anh Dung NGUYEN
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
rinnocente
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
Anh Dung NGUYEN
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
Anh Dung NGUYEN
 
Review Multicore processing based on ARM architecture
Review Multicore processing based on ARM architectureReview Multicore processing based on ARM architecture
Review Multicore processing based on ARM architecture
Mohammad Reza Khalifeh Mahmoodi
 
Zynq mp勉強会資料
Zynq mp勉強会資料Zynq mp勉強会資料
Zynq mp勉強会資料
一路 川染
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ Processors
The Linux Foundation
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
Ryo Jin
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
Linaro
 
Docker and Go: why did we decide to write Docker in Go?
Docker and Go: why did we decide to write Docker in Go?Docker and Go: why did we decide to write Docker in Go?
Docker and Go: why did we decide to write Docker in Go?
Jérôme Petazzoni
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
National Cheng Kung University
 
cache memory
cache memorycache memory

Viewers also liked (20)

ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory Systems
 
Intel vmcs-shadowing-paper
Intel vmcs-shadowing-paperIntel vmcs-shadowing-paper
Intel vmcs-shadowing-paper
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Lock free programming - pro tips devoxx uk
Lock free programming - pro tips devoxx ukLock free programming - pro tips devoxx uk
Lock free programming - pro tips devoxx uk
 
ARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARMARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARM
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
ARM AAE - System Issues
ARM AAE - System IssuesARM AAE - System Issues
ARM AAE - System Issues
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
Review Multicore processing based on ARM architecture
Review Multicore processing based on ARM architectureReview Multicore processing based on ARM architecture
Review Multicore processing based on ARM architecture
 
Zynq mp勉強会資料
Zynq mp勉強会資料Zynq mp勉強会資料
Zynq mp勉強会資料
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ Processors
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
 
Docker and Go: why did we decide to write Docker in Go?
Docker and Go: why did we decide to write Docker in Go?Docker and Go: why did we decide to write Docker in Go?
Docker and Go: why did we decide to write Docker in Go?
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 
cache memory
cache memorycache memory
cache memory
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 

Similar to Memory model

Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Prashant Rane
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
GlobalLogic Ukraine
 
Linux kernel development_ch9-10_20120410
Linux kernel development_ch9-10_20120410Linux kernel development_ch9-10_20120410
Linux kernel development_ch9-10_20120410huangachou
 
Linux kernel development chapter 10
Linux kernel development chapter 10Linux kernel development chapter 10
Linux kernel development chapter 10huangachou
 
jvm/java - towards lock-free concurrency
jvm/java - towards lock-free concurrencyjvm/java - towards lock-free concurrency
jvm/java - towards lock-free concurrency
Arvind Kalyan
 
Synchronization linux
Synchronization linuxSynchronization linux
Synchronization linuxSusant Sahani
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
Open Party
 
PFQ@ PAM12
PFQ@ PAM12PFQ@ PAM12
PFQ@ PAM12
Nicola Bonelli
 
[若渴計畫] Studying Concurrency
[若渴計畫] Studying Concurrency[若渴計畫] Studying Concurrency
[若渴計畫] Studying Concurrency
Aj MaChInE
 
Freckle
FreckleFreckle
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3
Shah Zaib
 
Os
OsOs
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuning
osa_ora
 
Concurrent/ parallel programming
Concurrent/ parallel programmingConcurrent/ parallel programming
Concurrent/ parallel programming
Tausun Akhtary
 
Bglrsession4
Bglrsession4Bglrsession4
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emptionLinux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Hemanth Venkatesh
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
Mothi R
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-usergdburton
 

Similar to Memory model (20)

Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Linux kernel development_ch9-10_20120410
Linux kernel development_ch9-10_20120410Linux kernel development_ch9-10_20120410
Linux kernel development_ch9-10_20120410
 
Linux kernel development chapter 10
Linux kernel development chapter 10Linux kernel development chapter 10
Linux kernel development chapter 10
 
jvm/java - towards lock-free concurrency
jvm/java - towards lock-free concurrencyjvm/java - towards lock-free concurrency
jvm/java - towards lock-free concurrency
 
Synchronization linux
Synchronization linuxSynchronization linux
Synchronization linux
 
Thread
ThreadThread
Thread
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
PFQ@ PAM12
PFQ@ PAM12PFQ@ PAM12
PFQ@ PAM12
 
gcdtmp
gcdtmpgcdtmp
gcdtmp
 
[若渴計畫] Studying Concurrency
[若渴計畫] Studying Concurrency[若渴計畫] Studying Concurrency
[若渴計畫] Studying Concurrency
 
Freckle
FreckleFreckle
Freckle
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3
 
Os
OsOs
Os
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuning
 
Concurrent/ parallel programming
Concurrent/ parallel programmingConcurrent/ parallel programming
Concurrent/ parallel programming
 
Bglrsession4
Bglrsession4Bglrsession4
Bglrsession4
 
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emptionLinux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emption
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 

More from Yi-Hsiu Hsu

Glow introduction
Glow introductionGlow introduction
Glow introduction
Yi-Hsiu Hsu
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
Yi-Hsiu Hsu
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introduction
Yi-Hsiu Hsu
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about C
Yi-Hsiu Hsu
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
Yi-Hsiu Hsu
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
Yi-Hsiu Hsu
 

More from Yi-Hsiu Hsu (6)

Glow introduction
Glow introductionGlow introduction
Glow introduction
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introduction
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about C
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
 

Recently uploaded

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 

Recently uploaded (20)

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 

Memory model

  • 3. Background •Synchronization of multithread program – Mutex (mutual exclusion) • Ensuring that no two processes or threads are in their critical section at the same time – Here, a critical section refers to a period of time when the process accesses a shared resource, such as shared memory 3
  • 4. Background – Semaphore • A mutex is essentially the same thing as a binary semaphore, and sometimes uses the same basic implementation • However, the term "mutex" is used to describe a construct which prevents two processes from accessing a shared resource concurrently • The term "binary semaphore" is used to describe a construct which limits access to a single resource • In many cases a mutex has a concept of an “owner” – the process which locked the mutex is the only process allowed to unlock it. In contrast, semaphores generally do not have this restriction – Semaphore vs. mutex • http://www.kernel.org/doc/Documentation/mutex-design.txt 4
  • 5. Synchronization and mutex Common synchronization methods 5 Reference: http://msdn.microsoft.com/e n-us/library/ms810047.aspx Windows mutex mechanisms Type of mutex IRQL considerations Recursion and thread details Interrupt spin lock Acquisition raises IRQL to DIRQ and returns previous IRQL to caller. Not recursive. Release on same thread as acquire. Spin lock Acquisition raises IRQL to DISPATCH_LEVEL and returns previous IRQL to caller. Not recursive. Release on same thread as acquire. Queued spin lock Acquisition raises IRQL to DISPATCH_LEVEL and stores previous IRQL in lock owner handle. Not recursive. Release on same thread as acquire. Fast mutex Acquisition raises IRQL to APC_LEVEL and stores previous IRQL in lock. Not recursive. Release on same thread as acquire. Kernel mutex (a kernel dispatcher object) Enters critical region upon acquisition and leaves critical region upon release. Recursive. Release on same thread as acquire. Synchronization event (a kernel dispatcher object) Acquisition does not change IRQL. Wait at IRQL <= APC_LEVEL and signal at IRQL <= DISPATCH_LEVEL. Not recursive. Release on the same thread or on a different thread. Unsafe fast mutex Acquisition does not change IRQL. Acquire and release at IRQL <= APC_LEVEL. Not recursive. Release on same thread as acquire. Synchronization method Description Windows mechanisms Interlocked operations Provides atomic logical, arithmetic, and list manipulation operations that are both thread-safe and multiprocessor safe. InterlockedXxx and ExInterlockedXxx routines Mutexes Provides (mutually) exclusive access to memory. Spin locks, fast mutexes, kernel mutexes, synchronization events Shared/exclusive lock Allows one thread to write or many threads to read the protected data. Executive resources Counted semaphore Allows a fixed number of acquisitions. Semaphores
  • 6. What is wrong with Mutexes? • Mutexes are perfectly fine, but you have a problem if there is lock contention – If you want your algorithm to be fast, you want to use the available cores as much as possible instead of letting them sleep – A thread can hold a mutex and be de-scheduled by the CPU (because of a cache miss or its time slice is over), then all the threads that want to acquire this mutex will be blocked – And if you have a lot of blocking, the OS also needs to do more context switches which are expensive because they clear the caches 6 Reference: http://woboq.com/blog/introduction -to-lockfree-programming.html
  • 7. What is wrong with Mutexes? • Problems with locking – Deadlock – Priority Inversion • Low-priority processes hold a lock required by a higher priority process – Convoying • All the other processes slow to the speed of the slowest one – Async-signal-safety • Signal handlers can’t use lock-based primitives – Kill-tolerant availability • What happens if threads are killed/crash while holding locks – Pre-emption tolerance • What happens if you’re pre-empted holding a lock – Overall performance 7 Reference: http://www.cs.cmu.edu/~410- s05/lectures/L31_LockFree.pdf
  • 8. So how can we do it without locking? • Lock-free Programming – Thread-safe access to shared data without the use of synchronization primitives such as mutexes – Practical with hardware support • Modern CPUs have something called atomic operations • The use of shared memory and an atomic instruction provides the mutual exclusion 8
  • 9. Atomic operation • Atomic operation – Processors have instructions that can be used to implement lock- free and wait-free algorithms • Atomic read-write • Atomic swap, also called XCHG • Test-and-set • Fetch-and-add • Compare-and-swap (CAS) – Compare and Exchange (CMPXCHG) instruction in the x86 and Itanium architectures – ABA problem » http://woboq.com/blog/introduction-to-lockfree-programming.html 9 Reference: http://en.wikipedia.org/wiki/Atomic_operation http://en.wikipedia.org/wiki/Read-modify-write
  • 10. Atomic operation • Load-Link/Store-Conditional – The LDREX and STREX instructions in ARM split the operation of atomically updating memory into two separate steps. Together, they provide atomic updates in conjunction with exclusive monitors that track exclusive memory accesses. Load-Exclusive and Store-Exclusive must only access memory regions marked as Normal – For example » LDREX R1, [R0] performs a Load-Exclusive from the address in R0, places the value into R1 and updates the exclusive monitor(s). » STREX R2, R1, [R0] performs a Store-Exclusive operation to the address in R0, conditionally storing the value from R1 and indicating success or failure in R2. 10 Reference: http://infocenter.arm.com/help/topic/co m.arm.doc.dht0008a/ch01s02s01.html http://infocenter.arm.com/help/topic/co m.arm.doc.dht0008a/CJAGCFAF.html Exclusive accesses to memory locations marked as Non-shareable are checked only against this local monitor. Exclusive accesses to memory locations marked as Shareable are checked against both the local monitor and the global monitor.
  • 11. Atomic operation • GCC Built-in functions for atomic memory access – http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Atomic-Builtins.html • Atomic operations supported in Linux Kernel – https://www.kernel.org/doc/Documentation/atomic_ops.txt • Atomic operations supported in C11/C++11 – C11 defines a new _Atomic() type specifier. You can declare an atomic integer like this: _Atomic(int) counter; – C++11 moves this declaration into the standard library: #include <atomic> std::atomic<int> counter; 11 Reference: http://www.informit.com/articles /article.aspx?p=1832575
  • 12. Atomic operation • Is atomic operation enough? • Linux-v3.7.8/arch/arm/include/asm/atomic.h 12 static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) { unsigned long oldval, res; smp_mb(); do { __asm__ __volatile__("@ atomic_cmpxchgn" "ldrex %1, [%3]n" "mov %0, #0n" "teq %1, %4n" "strexeq %0, %5, [%3]n" : "=&r" (res), "=&r" (oldval), "+Qo" (ptr->counter) : "r" (&ptr->counter), "Ir" (old), "r" (new) : "cc"); } while (res); smp_mb(); return oldval; } Reference: http://lxr.linux.no/#linux+v3.7.8/arch/ar m/include/asm/atomic.h#L115 Before talking about memory barrier, let’s see memory ordering first. Memory barrier
  • 14. Memory ordering • Memory ordering - memory access ordering – Program order • the order of the program’s object code as seen by the CPU, which might differ from the order in the source code due to compiler optimizations – Execution order • It can differ from program order due to both compiler and CPU implementation optimizations – Perceived order • It can differ from the execution order due to caching, interconnect, and memory- system optimizations • Why memory reordering – Performance! 14 Reference: http://www.rdrop.com/users/paulmck/sca lability/paper/ordering.2007.09.19a.pdf http://preshing.com/20120930/weak-vs- strong-memory-models
  • 15. Memory consistency models • Memory models – memory consistency models • Sequential consistency – all reads and all writes are in-order • Relaxed consistency – Some types of reordering are allowed • Loads can be reordered after loads (for better working of cache coherency, better scaling) • Loads can be reordered after stores • Stores can be reordered after stores • Stores can be reordered after loads • Weak consistency – Reads and Writes are arbitrarily reordered, limited only by explicit memory barriers 15
  • 16. Weak VS. Strong memory model 16 Reference: http://preshing.com/20120930/ weak-vs-strong-memory-models
  • 17. Memory ordering in some architectures 17 SPARC TSO = total-store order (default) SPARC RMO = relaxed-memory order (not supported on recent CPUs) SPARC PSO = partial store order (not supported on recent CPUs) Type Alpha ARMv7 PA-RISC POWER SPARC RMO SPARC PSO SPARC TSO x86 x86 oostore AMD64 IA-64 zSeries Loads reordered after loads Y Y Y Y Y Y Y Loads reordered after stores Y Y Y Y Y Y Y Stores reordered after stores Y Y Y Y Y Y Y Y Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y Atomic reordered with loads Y Y Y Y Y Atomic reordered with stores Y Y Y Y Y Y Dependent loads reordered Y Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y Reference: http://en.wikipedia.org/wiki/Memory_ordering
  • 18. Types of Memory Barrier • #LoadLoad • #StoreStore • #LoadStore • #StoreLoad – A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier 18 Reference: http://preshing.com/20120710/memory- barriers-are-like-source-control-operations if (IsPublished) // Load and check shared flag { LOADLOAD_FENCE(); // Prevent reordering of loads return Value; // Load published value } Value = x; // Publish some data STORESTORE_FENCE(); IsPublished = 1; // Set shared flag to indicate availability of data
  • 19. Memory barrier in compiler • GCC compiler memory barrier – These barriers prevent a compiler from reordering instructions, they do not prevent reordering by CPU. • GCC support for hardware memory barriers – This builtin issues a full memory barrier. 19 Reference: http://en.wikipedia.org/wiki/Memory_ordering http://gcc.gnu.org/onlinedocs/gcc- 4.6.3/gcc/Atomic-Builtins.html asm volatile("" ::: "memory"); or __asm__ __volatile__ ("" ::: "memory"); _sync_synchronize (...);
  • 20. Memory barriers in Linux kernel • General barrier – barrier() • Compiler barrier only. The compiler will not reorder memory accesses from one side of this statement to the other. This has no effect on the order that the processor actually executes the generated instructions. • Mandatory barriers – mb() • A full system memory barrier. All memory operations before the mb() in the instruction stream will be committed before any operations after the mb() are committed. This ordering will be visible to all bus masters in the system. It will also ensure the order in which accesses from a single processor reaches slave devices. – rmb() • Like mb(), but only guarantees ordering between read accesses. That is, all read operations before an rmb() will be committed before any read operations after the rmb(). – wmb() • Like mb(), but only guarantees ordering between write accesses. That is, all write operations before a wmb() will be committed before any write operations after the wmb(). 20 Reference: http://blogs.arm.com/software- enablement/448-memory-access-ordering- part-2-barriers-and-the-linux-kernel/ http://www.kernel.org/doc/Documentation/ memory-barriers.txt
  • 21. Memory barriers in Linux kernel • SMP conditional barriers – smp_mb() • Similar to mb(), but only guarantees ordering between cores/processors within an SMP system. All memory accesses before the smp_mb() will be visible to all cores within the SMP system before any accesses after the smp_mb(). – smp_rmb() • Like smp_mb(), but only guarantees ordering between read accesses. – smp_wmb() • Like smp_mb(), but only guarantees ordering between write accesses. – SMP barriers are a subset of mandatory barriers, not a superset. • An SMP barrier cannot replace a mandatory barrier, but a mandatory barrier can replace an SMP barrier. • Implicit barriers – Locking constructs in the kernel act as implicit SMP barriers, in the same way as pthread synchronization operations do in user space. – I/O accessor macros (readb(), iowrite32()) for the ARM architecture act as explicit memory barriers when kernel is compiled with CONFIG_ARM_DMA_MEM_BUFFERABLE. This was added in linux-2.6.35. • arch/arm/include/asm/io.h • arch/arm/mm/Kconfig 21 Reference: https://www.kernel.org/doc/Documentatio n/memory-barriers.txt
  • 23. Memory ordering in ARM Architecture • Memory types – Normal memory • Normal memory is effectively for all of your data and executable code • This memory type permits speculative reads, merging of accesses and repeating of reads without side effects • Accesses to Normal memory can always be buffered, and in most situations they are also cached - but they can be configured to be uncached • There is no implicit ordering of Normal memory accesses – Device memory and Strongly-ordered memory • Used with memory mapped peripherals or other control registers • Processors implementing the LPAE treat Device and Strongly-ordered memory regions identically • ARMv7-A processors that do not implement the LPAE can set device memory to be Shareable or Non-shareable • Accesses to these types of memory must happen exactly the number of times that executing the program suggests they should • There is no guarantee about ordering between memory accesses to different devices, or usually between accesses of different memory types 23 Reference: http://blogs.arm.com/software-enablement/594- memory-access-ordering-part-3-memory-access- ordering-in-the-arm-architecture/
  • 24. Memory ordering in ARM Architecture • Arranges of ARM Memory Types – Normal • Shareable or Non-shareable • Cacheable or Non-cacheable – Device (w/o LPAE) • Shareable or Non-shareable – Device (w LPAE) • Always shareable – Strongly-ordered • Always shareable • Have to wait slave’s access ACK 24 ARM ® Architecture Reference Manual ARMv7-A and ARMv7-R edition
  • 25. Memory ordering in ARM Architecture • Figure A3-5 shows the memory ordering between two explicit accesses A1 and A2, where A1 occurs before A2 in program order  < Accesses must arrive at any particular memory-mapped peripheral or block of memory in program order, that is, A1 must arrive before A2. There are no ordering restrictions about when accesses arrive at different peripherals or blocks of memory.  – Accesses can arrive at any memory-mapped peripheral or block of memory in any order. 25
  • 26. Memory ordering in ARM Architecture • Barriers – Barriers were introduced progressively into the ARM architecture • Some ARMv5 processors, such as the ARM926EJ-S, implemented a Drain Write Buffer cp15 operation, which halted execution until any buffered writes had drained into the external memory system • With the introduction of the ARMv6 memory model, this operation was redefined in more architectural terms and became the Data Synchronization Barrier – ARMv6 also introduced the new Data Memory Barrier and Flush Prefetch Buffer cp15 operations • ARMv7 evolved the memory model somewhat, extending the meaning of the barriers - and the Flush Prefetch Buffer operation was renamed the Instruction Synchronization Barrier • ARMv7 also allocated dedicated instruction encodings for the barrier operations – Use of the cp15 operations is now deprecated and software targeting ARMv7 or later should use the DMB, DSB and ISB mnemonics. • And finally, ARMv7 extended the Shareability concept to cover both Inner-shareable and Outer-shareable domains – This together with AMBA4 ACE gives us barriers that propagate into the memory system 26
  • 27. Memory ordering in ARM Architecture – Instruction Synchronization Barrier (ISB) • The ISB ensures that any subsequent instructions are fetched anew from cache in order that privilege and access is checked with the current MMU configuration – It is used to ensure any previously executed context changing operations will have completed by the time the ISB completed • Access type and domain are not really relevant for this barrier – It is not used in any of the Linux memory barrier primitives, but appears in memory management, cache control and context switching code 27
  • 28. Memory ordering in ARM Architecture – Data Memory Barrier (DMB) • DMB prevents reordering of data accesses instructions across itself – All data accesses by this processor/core before the DMB will be visible to all other masters within the specified shareability domain before any of the data accesses after it – It also ensures that any explicit preceding data/unified cache maintenance operations have completed before any subsequent data accesses are executed – The DMB instruction takes two optional parameters: an operation type (stores only - 'ST' - or loads and stores) and a domain – The default operation type is loads and stores and the default domain is System • In the Linux kernel, the DMB instruction is used for the smp_*mb() macros 28
  • 29. Memory ordering in ARM Architecture – Data Synchronization Barrier (DSB) • DSB enforces the same ordering as the Data Memory Barrier – But it also blocks execution of any further instructions until synchronization is complete – It also waits until all cache and branch predictor maintenance operations have completed for the specified shareability domain – If the access type is load and store then it also waits for any TLB maintenance operations to complete. • In the Linux kernel, the DSB instruction is used for the *mb() macros. 29
  • 30. Domain Abbreviatio n Description Non-shareable NSH A domain consisting only of the local agent. Accesses that never need to be synchronized with other cores, processors or devices. Not normally used in SMP systems. Inner Shareable ISH A domain potentially shared by multiple agents, but usually not all agents in the system. A system can have multiple Inner Shareable domains. An operation that affects one Inner Shareable domain does not affect other Inner Shareable domains in the system. Outer Shareable OSH A domain almost certainly shared by multiple agents, and quite likely consisting of several Inner Shareable domains. An operation that affects an Outer Shareable domain also implicitly affects all Inner Shareable domains within it. For processors such as the Cortex-A15 MPCore that implement the LPAE, all Device memory accesses are considered Outer Shareable. For other processors, the shareability attribute can be explicitly set (to shareable or non-shareable). Full system SY An operation on the full system affects all agents in the system; all Non-shareable regions, all Inner Shareable regions and all Outer Shareable regions. Simple peripherals such as UARTs, and several more complex ones, are not normally necessary to have in a restricted shareability domain. Memory ordering in ARM Architecture • Shareability domains – Shareability domains define "zones" within the bus topology within which memory accesses are to be kept consistent (taking place in a predictable way) and potentially coherent (with hardware support) – Outside of this domain, observers might not see the same order of memory accesses as inside it 30 Reference: http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/CIHGHHIE.html ARMv7
  • 31. Memory ordering in ARM Architecture 31 Allocated values for the data barriers (DMB/DSB)ARMv8
  • 32. Memory ordering in ARM Architecture • The shareability domains example 32 4 cores per cluster, 2 clusters per chip
  • 34. Memory model supported in C++11 • C++ Memory model – Sequential consistent/acquire-release/relaxed • http://en.cppreference.com/w/cpp/atomic/memory_order • http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html 34
  • 35. Acquire and Release Semantics • ARMv8 AArch64/AArch32 support load-acquire/store-release instructions – The Load-Acquire/Store-Release instructions can remove the requirement to use the explicit DMB memory barrier instruction 35 Reference: http://preshing.com/20120913/acq uire-and-release-semantics http://www.arm.com/files/downloa ds/ARMv8_Architecture.pdf Acquire semantics is a property which can only apply to operations which read from shared memory. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order. Release semantics is a property which can only apply to operations which write to shared memory. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.
  • 36. Acquire and Release Semantics • An demo example 36 //Shared global variables int A = 0; int Ready = 0; //Thread 1 A = 42; Ready = 1; //Thread 2 int r1 = Ready; int r2 = A; //Possible results of r1, r2 r1 =0 r2 = 0 r2 = 42 r1 = 1 r2 = 0 r2 = 42 //Shared global variables int A = 0; Atomic<int> Ready = 0; //Thread 1 A = 42; Ready.store(1, memory_order_release); //Thread 2 int r1 = Ready.load(memory_ord er_acquire); int r2 = A; //Possible results of r1, r2 r1 =0 r2 = 0 r2 = 42 r1 = 1 r2 = 42
  • 37. Acquire and Release Semantics • A Write-Release Can Synchronize-With a Read-Acquire 37 // Thread 1 void SendTestMessage(void* param) { // Copy to shared memory using non-atomic stores. g_payload.tick = clock(); g_payload.str = "TestMessage"; g_payload.param = param; // Perform an atomic write-release to indicate that the message is ready. g_guard.store(1, std::memory_order_release); } // Thread 2 bool TryReceiveMessage(Message& result) { // Perform an atomic read-acquire to check whether the message is ready. int ready = g_guard.load(std::memory_order_acquire); if (ready != 0) { // Yes. Copy from shared memory using non-atomic loads. result.tick = g_payload.tick; result.str = g_msg_str; result.param = g_payload.param; return true; } // No. return false; } Reference: http://preshing.com/20130823/ the-synchronizes-with-relation/
  • 39. Volatile V.S. memory-order/atomic • What does the volatile keyword mean? 39 Reference: http://www.drdobbs.com/parallel/vola tile-vs-volatile/212701484
  • 40. Volatile V.S. memory-order/atomic • C programmers have often taken volatile to mean that the variable could be changed outside of the current thread of execution – as a result, they are sometimes tempted to use it in kernel code when shared data structures are being used – In other words, they have been known to treat volatile types as a sort of easy atomic variable, which they are not – The use of volatile in kernel code is almost never correct • The key point to understand with regard to volatile is that its purpose is to suppress optimization, which is almost never what one really wants to do • In the kernel, one must protect shared data structures against unwanted concurrent access, which is very much a different task • Like volatile, the kernel primitives which make concurrent access to data safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent unwanted optimization. If they are being used properly, there will be no need to use volatile as well 40 Reference: https://www.kernel.org/doc/Document ation/volatile-considered-harmful.txt
  • 41. Volatile V.S. memory-order/atomic • To safely write lock-free code that communicates between threads without using locks – prefer to use ordered atomic variables – Java/.NET volatile, C++0x atomic<T>, and C-compatible atomic_T • To safely communicate with special hardware or other memory that has unusual semantics – use un-optimizable variables: ISO C/C++ volatile – Remember that reads and writes of these variables are not necessarily atomic • To protect shared data structures against unwanted concurrent access in kernel code – use kernel concurrent access primitives, like spinlocks, mutexes, memory barriers • Finally, to express a variable that both has unusual semantics and has any or all of the atomicity and/or ordering guarantees needed for lock-free coding – only the ISO C++11 Standard provides a direct way to spell it: volatile atomic<T> 41
  • 42. USAGE OF MEMORY BARRIER 42
  • 43. Usage of memory barrier instructions • In what situations might I need to insert memory barrier instructions? – Mutexes 43 Reference: http://infocenter.arm.com/help/topic/ com.arm.doc.genc007826/Barrier_Lit mus_Tests_and_Cookbook_A08.pdf http://infocenter.arm.com/help/topic/ com.arm.doc.faqs/ka14041.html LOCKED EQU 1 UNLOCKED EQU 0 lock_mutex ; Is mutex locked? LDREX r1, [r0] ; Check if locked CMP r1, #LOCKED ; Compare with "locked" WFEEQ ; Mutex is locked, go into standby BEQ lock_mutex ; On waking re-check the mutex ; Attempt to lock mutex MOV r1, #LOCKED STREX r2, r1, [r0] ; Attempt to lock mutex CMP r2, #0x0 ; Check whether store completed BNE lock_mutex ; If store failed, try again DMB ; Required before accessing protected resource BX lr unlock_mutex DMB ; Ensure accesses to protected resource have completed MOV r1, #UNLOCKED ; Write "unlocked" into lock field STR r1, [r0] DSB ; Ensure update of the mutex occurs before other CPUs wake SEV ; Send event to other CPUs, wakes any CPU waiting on using WFE BX lr
  • 44. Usage of memory barrier instructions – Memory Remapping • Consider a situation where your reset handler/boot code lives in Flash memory (ROM), which is aliased to address 0x0 to ensure that your program boots correctly from the vector table, which normally resides at the bottom of memory (see left-hand-side memory map). • After you have initialized your system, you may wish to turn off the Flash memory alias so that you can use the bottom portion of memory for RAM (see right-hand-side memory map). The following code (running from the permanent Flash memory region) disables the Flash alias, before calling a memory block copying routine (e.g., memcpy) to copy some data from to the bottom portion of memory (RAM). 44 MOV r0, #0 MOV r1, #REMAP_REG STR r0, [r1] ; Disable Flash alias BL block_copy_routine() ; Block copy code into RAM BL copied_routine() ; Execute copied routine (now in RAM) DMB ; Ensure above str completion with DMB DSB ; Ensure block copy is completed with DSB ISB ; Ensure pipeline flush with ISB Question
  • 45. Usage of memory barrier instructions – Self-modifying code – If the memory you are performing the block copying routine on is marked as 'cacheable' the instruction cache will need to be invalidated so that the processor does not execute any other 'cached' code. – For "write-back" regions the data cache must be cleaned before the instruction cache invalidate. 45 Overlay_manager ; ... BL block_copy ; Copy new routine from ROM to RAM B relocated_code ; Branch to new routine Overlay_manager ; ... BL block_copy ; Copy new routine from ROM to RAM data_cache_clean ; Clean the cache so that the new routine is written out to memory icache_and_pb_invalidate ; Invalidate the instruction cache and branch predictor so that the ; old routine is no longer cached B relocated_code ; Branch to new routine DSB ; Ensure block copy has completed ISB ; Flush pipeline to ensure processor fetches new instructions DSB ; Ensure data cache clean has completed DSB ; Ensure invalidate has completed ISB ; Flush pipeline to ensure processor fetches new instructions

Editor's Notes

  1. http://lists.infradead.org/pipermail/linux-arm-kernel/2010-July/019912.html http://home.deib.polimi.it/silvano/FilePDF/ARC-MULTIMEDIA/ARM_IEEEComputer_July2005_01463106.pdf
  2. There can be incoherent instruction cache pipeline, which prevent self-modifying code to be executed without special ICache flush/reload instructions.
  3. Why device I/O access operations need memory barriers?
  4. Device: need not to wait for slave’s access ACK Strongly-ordered: have to wait slave’s access ACK
  5. Write through, memory non-cacheable Write back write allocate, memory cacheable