Memory ordering
2014
issue.hsu@gmail.com
SYNCHRONIZATION
2
Background
•Synchronization of multithread
program
– Mutex (mutual exclusion)
• Ensuring that no two processes or threads are in their critical section
at the same time
– Here, a critical section refers to a period of time when the process
accesses a shared resource, such as shared memory
3
Background
– Semaphore
• A mutex is essentially the same thing as a binary semaphore, and
sometimes uses the same basic implementation
• However, the term "mutex" is used to describe a construct which
prevents two processes from accessing a shared resource
concurrently
• The term "binary semaphore" is used to describe a construct which
limits access to a single resource
• In many cases a mutex has a concept of an “owner”
– the process which locked the mutex is the only process allowed to
unlock it. In contrast, semaphores generally do not have this
restriction
– Semaphore vs. mutex
• http://www.kernel.org/doc/Documentation/mutex-design.txt
4
Synchronization
and mutex
Common synchronization methods
5
Reference:
http://msdn.microsoft.com/e
n-us/library/ms810047.aspx
Windows mutex mechanisms
Type of mutex IRQL considerations Recursion and thread details
Interrupt spin lock Acquisition raises IRQL to DIRQ and returns
previous IRQL to caller.
Not recursive. Release on same
thread as acquire.
Spin lock Acquisition raises IRQL to
DISPATCH_LEVEL and returns previous
IRQL to caller.
Not recursive. Release on same
thread as acquire.
Queued spin lock Acquisition raises IRQL to
DISPATCH_LEVEL and stores previous
IRQL in lock owner handle.
Not recursive. Release on same
thread as acquire.
Fast mutex Acquisition raises IRQL to APC_LEVEL and
stores previous IRQL in lock.
Not recursive. Release on same
thread as acquire.
Kernel mutex (a
kernel dispatcher
object)
Enters critical region upon acquisition and
leaves critical region upon release.
Recursive. Release on same
thread as acquire.
Synchronization
event (a kernel
dispatcher object)
Acquisition does not change IRQL. Wait at
IRQL <= APC_LEVEL and signal at IRQL
<= DISPATCH_LEVEL.
Not recursive. Release on the
same thread or on a different
thread.
Unsafe fast mutex Acquisition does not change IRQL. Acquire
and release at IRQL <= APC_LEVEL.
Not recursive. Release on same
thread as acquire.
Synchronization
method
Description Windows mechanisms
Interlocked
operations
Provides atomic logical,
arithmetic, and list
manipulation operations that
are both thread-safe and
multiprocessor safe.
InterlockedXxx and
ExInterlockedXxx routines
Mutexes Provides (mutually) exclusive
access to memory.
Spin locks, fast mutexes,
kernel mutexes,
synchronization events
Shared/exclusive
lock
Allows one thread to write or
many threads to read the
protected data.
Executive resources
Counted semaphore Allows a fixed number of
acquisitions.
Semaphores
What is wrong with Mutexes?
• Mutexes are perfectly fine, but you have a problem if there is lock
contention
– If you want your algorithm to be fast, you want to use the
available cores as much as possible instead of letting them sleep
– A thread can hold a mutex and be de-scheduled by the CPU
(because of a cache miss or its time slice is over), then all the
threads that want to acquire this mutex will be blocked
– And if you have a lot of blocking, the OS also needs to do more
context switches which are expensive because they clear the
caches
6
Reference:
http://woboq.com/blog/introduction
-to-lockfree-programming.html
What is wrong with Mutexes?
• Problems with locking
– Deadlock
– Priority Inversion
• Low-priority processes hold a lock required by a higher priority process
– Convoying
• All the other processes slow to the speed of the slowest one
– Async-signal-safety
• Signal handlers can’t use lock-based primitives
– Kill-tolerant availability
• What happens if threads are killed/crash while holding locks
– Pre-emption tolerance
• What happens if you’re pre-empted holding a lock
– Overall performance
7
Reference:
http://www.cs.cmu.edu/~410-
s05/lectures/L31_LockFree.pdf
So how can we do it without locking?
• Lock-free Programming
– Thread-safe access to shared data without the use of
synchronization primitives such as mutexes
– Practical with hardware support
• Modern CPUs have something called atomic operations
• The use of shared memory and an atomic instruction provides the
mutual exclusion
8
Atomic operation
• Atomic operation
– Processors have instructions that can be used to implement lock-
free and wait-free algorithms
• Atomic read-write
• Atomic swap, also called XCHG
• Test-and-set
• Fetch-and-add
• Compare-and-swap (CAS)
– Compare and Exchange (CMPXCHG) instruction in the x86 and
Itanium architectures
– ABA problem
» http://woboq.com/blog/introduction-to-lockfree-programming.html
9
Reference:
http://en.wikipedia.org/wiki/Atomic_operation
http://en.wikipedia.org/wiki/Read-modify-write
Atomic operation
• Load-Link/Store-Conditional
– The LDREX and STREX instructions in ARM split the operation of
atomically updating memory into two separate steps. Together, they provide
atomic updates in conjunction with exclusive monitors that track exclusive
memory accesses. Load-Exclusive and Store-Exclusive must only access
memory regions marked as Normal
– For example
» LDREX R1, [R0] performs a Load-Exclusive from the address in R0, places the value into
R1 and updates the exclusive monitor(s).
» STREX R2, R1, [R0] performs a Store-Exclusive operation to the address in R0,
conditionally storing the value from R1 and indicating success or failure in R2.
10
Reference:
http://infocenter.arm.com/help/topic/co
m.arm.doc.dht0008a/ch01s02s01.html
http://infocenter.arm.com/help/topic/co
m.arm.doc.dht0008a/CJAGCFAF.html
Exclusive accesses to memory locations
marked as Non-shareable are checked
only against this local monitor. Exclusive
accesses to memory locations marked as
Shareable are checked against both the
local monitor and the global monitor.
Atomic operation
• GCC Built-in functions for atomic memory access
– http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Atomic-Builtins.html
• Atomic operations supported in Linux Kernel
– https://www.kernel.org/doc/Documentation/atomic_ops.txt
• Atomic operations supported in C11/C++11
– C11 defines a new _Atomic() type specifier. You can declare an
atomic integer like this:
_Atomic(int) counter;
– C++11 moves this declaration into the standard library:
#include <atomic>
std::atomic<int> counter;
11
Reference:
http://www.informit.com/articles
/article.aspx?p=1832575
Atomic operation
• Is atomic operation enough?
• Linux-v3.7.8/arch/arm/include/asm/atomic.h
12
static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
{
unsigned long oldval, res;
smp_mb();
do {
__asm__ __volatile__("@ atomic_cmpxchgn"
"ldrex %1, [%3]n"
"mov %0, #0n"
"teq %1, %4n"
"strexeq %0, %5, [%3]n"
: "=&r" (res), "=&r" (oldval), "+Qo" (ptr->counter)
: "r" (&ptr->counter), "Ir" (old), "r" (new)
: "cc");
} while (res);
smp_mb();
return oldval;
}
Reference:
http://lxr.linux.no/#linux+v3.7.8/arch/ar
m/include/asm/atomic.h#L115
Before talking about memory
barrier, let’s see memory ordering
first.
Memory barrier
MEMORY ORDERING
CONCEPT
13
Memory ordering
• Memory ordering - memory access ordering
– Program order
• the order of the program’s object code as seen by the CPU, which might differ from
the order in the source code due to compiler optimizations
– Execution order
• It can differ from program order due to both compiler and CPU implementation
optimizations
– Perceived order
• It can differ from the execution order due to caching, interconnect, and memory-
system optimizations
• Why memory reordering
– Performance!
14
Reference:
http://www.rdrop.com/users/paulmck/sca
lability/paper/ordering.2007.09.19a.pdf
http://preshing.com/20120930/weak-vs-
strong-memory-models
Memory consistency models
• Memory models – memory consistency models
• Sequential consistency
– all reads and all writes are in-order
• Relaxed consistency
– Some types of reordering are allowed
• Loads can be reordered after loads (for better working of cache coherency,
better scaling)
• Loads can be reordered after stores
• Stores can be reordered after stores
• Stores can be reordered after loads
• Weak consistency
– Reads and Writes are arbitrarily reordered, limited only by explicit
memory barriers
15
Weak VS. Strong memory model
16
Reference:
http://preshing.com/20120930/
weak-vs-strong-memory-models
Memory ordering in some architectures
17
SPARC TSO = total-store order (default)
SPARC RMO = relaxed-memory order (not supported on recent
CPUs)
SPARC PSO = partial store order (not supported on recent CPUs)
Type Alpha ARMv7 PA-RISC POWER
SPARC
RMO
SPARC
PSO
SPARC
TSO
x86
x86
oostore
AMD64 IA-64 zSeries
Loads reordered after loads Y Y Y Y Y Y Y
Loads reordered after stores Y Y Y Y Y Y Y
Stores reordered after stores Y Y Y Y Y Y Y Y
Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y
Atomic reordered with loads Y Y Y Y Y
Atomic reordered with stores Y Y Y Y Y Y
Dependent loads reordered Y
Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y
Reference:
http://en.wikipedia.org/wiki/Memory_ordering
Types of Memory Barrier
• #LoadLoad
• #StoreStore
• #LoadStore
• #StoreLoad
– A StoreLoad barrier ensures that all stores performed before the barrier are visible to other
processors, and that all loads performed after the barrier receive the latest value that is visible at the
time of the barrier
18
Reference:
http://preshing.com/20120710/memory-
barriers-are-like-source-control-operations
if (IsPublished) // Load and check shared flag
{
LOADLOAD_FENCE(); // Prevent reordering of loads
return Value; // Load published value
}
Value = x; // Publish some data
STORESTORE_FENCE();
IsPublished = 1; // Set shared flag to indicate availability of data
Memory barrier in compiler
• GCC compiler memory barrier
– These barriers prevent a compiler from reordering instructions,
they do not prevent reordering by CPU.
• GCC support for hardware memory barriers
– This builtin issues a full memory barrier.
19
Reference:
http://en.wikipedia.org/wiki/Memory_ordering
http://gcc.gnu.org/onlinedocs/gcc-
4.6.3/gcc/Atomic-Builtins.html
asm volatile("" ::: "memory");
or
__asm__ __volatile__ ("" ::: "memory");
_sync_synchronize (...);
Memory barriers in Linux kernel
• General barrier
– barrier()
• Compiler barrier only. The compiler will not reorder memory accesses from one side of this
statement to the other. This has no effect on the order that the processor actually executes
the generated instructions.
• Mandatory barriers
– mb()
• A full system memory barrier. All memory operations before the mb() in the instruction
stream will be committed before any operations after the mb() are committed. This ordering
will be visible to all bus masters in the system. It will also ensure the order in which
accesses from a single processor reaches slave devices.
– rmb()
• Like mb(), but only guarantees ordering between read accesses. That is, all read
operations before an rmb() will be committed before any read operations after the rmb().
– wmb()
• Like mb(), but only guarantees ordering between write accesses. That is, all write
operations before a wmb() will be committed before any write operations after the wmb().
20
Reference:
http://blogs.arm.com/software-
enablement/448-memory-access-ordering-
part-2-barriers-and-the-linux-kernel/
http://www.kernel.org/doc/Documentation/
memory-barriers.txt
Memory barriers in Linux kernel
• SMP conditional barriers
– smp_mb()
• Similar to mb(), but only guarantees ordering between cores/processors within an
SMP system. All memory accesses before the smp_mb() will be visible to all cores
within the SMP system before any accesses after the smp_mb().
– smp_rmb()
• Like smp_mb(), but only guarantees ordering between read accesses.
– smp_wmb()
• Like smp_mb(), but only guarantees ordering between write accesses.
– SMP barriers are a subset of mandatory barriers, not a superset.
• An SMP barrier cannot replace a mandatory barrier, but a mandatory barrier can
replace an SMP barrier.
• Implicit barriers
– Locking constructs in the kernel act as implicit SMP barriers, in the same way
as pthread synchronization operations do in user space.
– I/O accessor macros (readb(), iowrite32()) for the ARM architecture act as
explicit memory barriers when kernel is compiled with
CONFIG_ARM_DMA_MEM_BUFFERABLE. This was added in linux-2.6.35.
• arch/arm/include/asm/io.h
• arch/arm/mm/Kconfig
21
Reference:
https://www.kernel.org/doc/Documentatio
n/memory-barriers.txt
MEMORY ORDERING IN ARM
22
Memory ordering in ARM Architecture
• Memory types
– Normal memory
• Normal memory is effectively for all of your data and executable code
• This memory type permits speculative reads, merging of accesses and repeating of
reads without side effects
• Accesses to Normal memory can always be buffered, and in most situations they
are also cached - but they can be configured to be uncached
• There is no implicit ordering of Normal memory accesses
– Device memory and Strongly-ordered memory
• Used with memory mapped peripherals or other control registers
• Processors implementing the LPAE treat Device and Strongly-ordered memory
regions identically
• ARMv7-A processors that do not implement the LPAE can set device memory to be
Shareable or Non-shareable
• Accesses to these types of memory must happen exactly the number of times that
executing the program suggests they should
• There is no guarantee about ordering between memory accesses to different
devices, or usually between accesses of different memory types
23
Reference:
http://blogs.arm.com/software-enablement/594-
memory-access-ordering-part-3-memory-access-
ordering-in-the-arm-architecture/
Memory ordering in ARM Architecture
• Arranges of ARM Memory Types
– Normal
• Shareable or Non-shareable
• Cacheable or Non-cacheable
– Device (w/o LPAE)
• Shareable or Non-shareable
– Device (w LPAE)
• Always shareable
– Strongly-ordered
• Always shareable
• Have to wait slave’s access ACK
24
ARM ® Architecture Reference
Manual
ARMv7-A and ARMv7-R edition
Memory ordering in ARM Architecture
• Figure A3-5 shows the memory ordering between two explicit accesses A1 and A2,
where A1 occurs before A2 in program order
 < Accesses must arrive at any particular memory-mapped peripheral or block of
memory in program order, that is, A1 must arrive before A2. There are no ordering
restrictions about when accesses arrive at different peripherals or blocks of
memory.
 – Accesses can arrive at any memory-mapped peripheral or block of memory in
any order.
25
Memory ordering in ARM Architecture
• Barriers
– Barriers were introduced progressively into the ARM architecture
• Some ARMv5 processors, such as the ARM926EJ-S, implemented a Drain Write
Buffer cp15 operation, which halted execution until any buffered writes had drained
into the external memory system
• With the introduction of the ARMv6 memory model, this operation was redefined in
more architectural terms and became the Data Synchronization Barrier
– ARMv6 also introduced the new Data Memory Barrier and Flush Prefetch Buffer
cp15 operations
• ARMv7 evolved the memory model somewhat, extending the meaning of the
barriers - and the Flush Prefetch Buffer operation was renamed the Instruction
Synchronization Barrier
• ARMv7 also allocated dedicated instruction encodings for the barrier operations
– Use of the cp15 operations is now deprecated and software targeting ARMv7 or
later should use the DMB, DSB and ISB mnemonics.
• And finally, ARMv7 extended the Shareability concept to cover both Inner-shareable
and Outer-shareable domains
– This together with AMBA4 ACE gives us barriers that propagate into the memory
system
26
Memory ordering in ARM Architecture
– Instruction Synchronization Barrier (ISB)
• The ISB ensures that any subsequent instructions are fetched anew
from cache in order that privilege and access is checked with the
current MMU configuration
– It is used to ensure any previously executed context changing
operations will have completed by the time the ISB completed
• Access type and domain are not really relevant for this barrier
– It is not used in any of the Linux memory barrier primitives, but
appears in memory management, cache control and context
switching code
27
Memory ordering in ARM Architecture
– Data Memory Barrier (DMB)
• DMB prevents reordering of data accesses instructions across itself
– All data accesses by this processor/core before the DMB will be
visible to all other masters within the specified shareability domain
before any of the data accesses after it
– It also ensures that any explicit preceding data/unified cache
maintenance operations have completed before any subsequent
data accesses are executed
– The DMB instruction takes two optional parameters: an operation
type (stores only - 'ST' - or loads and stores) and a domain
– The default operation type is loads and stores and the default
domain is System
• In the Linux kernel, the DMB instruction is used for the smp_*mb()
macros
28
Memory ordering in ARM Architecture
– Data Synchronization Barrier (DSB)
• DSB enforces the same ordering as the Data Memory Barrier
– But it also blocks execution of any further instructions until
synchronization is complete
– It also waits until all cache and branch predictor maintenance
operations have completed for the specified shareability domain
– If the access type is load and store then it also waits for any TLB
maintenance operations to complete.
• In the Linux kernel, the DSB instruction is used for the *mb() macros.
29
Domain
Abbreviatio
n
Description
Non-shareable NSH
A domain consisting only of the local agent. Accesses that never need to be synchronized with other
cores, processors or devices. Not normally used in SMP systems.
Inner
Shareable
ISH
A domain potentially shared by multiple agents, but usually not all agents in the system.
A system can have multiple Inner Shareable domains. An operation that affects one Inner Shareable
domain does not affect other Inner Shareable domains in the system.
Outer
Shareable
OSH
A domain almost certainly shared by multiple agents, and quite likely consisting of several Inner
Shareable domains. An operation that affects an Outer Shareable domain also implicitly affects all
Inner Shareable domains within it.
For processors such as the Cortex-A15 MPCore that implement the LPAE, all Device memory
accesses are considered Outer Shareable. For other processors, the shareability attribute can be
explicitly set (to shareable or non-shareable).
Full system SY
An operation on the full system affects all agents in the system; all Non-shareable regions, all Inner
Shareable regions and all Outer Shareable regions. Simple peripherals such as UARTs, and several
more complex ones, are not normally necessary to have in a restricted shareability domain.
Memory ordering in ARM Architecture
• Shareability domains
– Shareability domains define "zones" within the bus topology within which memory
accesses are to be kept consistent (taking place in a predictable way) and
potentially coherent (with hardware support)
– Outside of this domain, observers might not see the same order of memory
accesses as inside it
30
Reference:
http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/CIHGHHIE.html
ARMv7
Memory ordering in ARM Architecture
31
Allocated values for the data barriers (DMB/DSB)ARMv8
Memory ordering in ARM Architecture
• The shareability domains example
32
4 cores per cluster,
2 clusters per chip
ACQUIRE AND RELEASE
33
Memory model supported in C++11
• C++ Memory model
– Sequential consistent/acquire-release/relaxed
• http://en.cppreference.com/w/cpp/atomic/memory_order
• http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
34
Acquire and Release Semantics
• ARMv8 AArch64/AArch32 support load-acquire/store-release
instructions
– The Load-Acquire/Store-Release instructions can remove the requirement to use
the explicit DMB memory barrier instruction
35
Reference:
http://preshing.com/20120913/acq
uire-and-release-semantics
http://www.arm.com/files/downloa
ds/ARMv8_Architecture.pdf
Acquire semantics is a property which can only apply to
operations which read from shared memory. The operation is
then considered a read-acquire. Acquire semantics prevent
memory reordering of the read-acquire with any read or write
operation which follows it in program order.
Release semantics is a property which can only apply to
operations which write to shared memory. The operation is then
considered a write-release. Release semantics prevent memory
reordering of the write-release with any read or write operation
which precedes it in program order.
Acquire and Release Semantics
• An demo example
36
//Shared global variables
int A = 0;
int Ready = 0;
//Thread 1
A = 42;
Ready = 1;
//Thread 2
int r1 = Ready;
int r2 = A;
//Possible results of r1, r2
r1 =0 r2 = 0
r2 = 42
r1 = 1 r2 = 0
r2 = 42
//Shared global variables
int A = 0;
Atomic<int> Ready = 0;
//Thread 1
A = 42;
Ready.store(1,
memory_order_release);
//Thread 2
int r1 =
Ready.load(memory_ord
er_acquire);
int r2 = A;
//Possible results of r1, r2
r1 =0 r2 = 0
r2 = 42
r1 = 1 r2 = 42
Acquire and Release Semantics
• A Write-Release Can Synchronize-With a Read-Acquire
37
// Thread 1
void SendTestMessage(void* param)
{
// Copy to shared memory using non-atomic stores.
g_payload.tick = clock();
g_payload.str = "TestMessage";
g_payload.param = param;
// Perform an atomic write-release to indicate that the message is ready.
g_guard.store(1, std::memory_order_release);
}
// Thread 2
bool TryReceiveMessage(Message& result)
{
// Perform an atomic read-acquire to check whether the message is ready.
int ready = g_guard.load(std::memory_order_acquire);
if (ready != 0)
{
// Yes. Copy from shared memory using non-atomic loads.
result.tick = g_payload.tick;
result.str = g_msg_str;
result.param = g_payload.param;
return true;
}
// No.
return false;
}
Reference:
http://preshing.com/20130823/
the-synchronizes-with-relation/
VOLATILE
38
Volatile V.S. memory-order/atomic
• What does the volatile keyword mean?
39
Reference:
http://www.drdobbs.com/parallel/vola
tile-vs-volatile/212701484
Volatile V.S. memory-order/atomic
• C programmers have often taken volatile to mean that the variable could
be changed outside of the current thread of execution
– as a result, they are sometimes tempted to use it in kernel code
when shared data structures are being used
– In other words, they have been known to treat volatile types as a sort
of easy atomic variable, which they are not
– The use of volatile in kernel code is almost never correct
• The key point to understand with regard to volatile is that its purpose is to
suppress optimization, which is almost never what one really wants to do
• In the kernel, one must protect shared data structures against unwanted
concurrent access, which is very much a different task
• Like volatile, the kernel primitives which make concurrent access to data
safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent
unwanted optimization. If they are being used properly, there will be no
need to use volatile as well
40
Reference:
https://www.kernel.org/doc/Document
ation/volatile-considered-harmful.txt
Volatile V.S. memory-order/atomic
• To safely write lock-free code that communicates between threads without using
locks
– prefer to use ordered atomic variables
– Java/.NET volatile, C++0x atomic<T>, and C-compatible atomic_T
• To safely communicate with special hardware or other memory that has unusual
semantics
– use un-optimizable variables: ISO C/C++ volatile
– Remember that reads and writes of these variables are not necessarily
atomic
• To protect shared data structures against unwanted concurrent access in kernel
code
– use kernel concurrent access primitives, like spinlocks, mutexes, memory
barriers
• Finally, to express a variable that both has unusual semantics and has any or all
of the atomicity and/or ordering guarantees needed for lock-free coding
– only the ISO C++11 Standard provides a direct way to spell it: volatile
atomic<T>
41
USAGE OF MEMORY BARRIER
42
Usage of memory barrier
instructions
• In what situations might I need to insert memory barrier instructions?
– Mutexes
43
Reference:
http://infocenter.arm.com/help/topic/
com.arm.doc.genc007826/Barrier_Lit
mus_Tests_and_Cookbook_A08.pdf
http://infocenter.arm.com/help/topic/
com.arm.doc.faqs/ka14041.html
LOCKED EQU 1
UNLOCKED EQU 0
lock_mutex
; Is mutex locked?
LDREX r1, [r0] ; Check if locked
CMP r1, #LOCKED ; Compare with "locked"
WFEEQ ; Mutex is locked, go into standby
BEQ lock_mutex ; On waking re-check the mutex
; Attempt to lock mutex
MOV r1, #LOCKED
STREX r2, r1, [r0] ; Attempt to lock mutex
CMP r2, #0x0 ; Check whether store completed
BNE lock_mutex ; If store failed, try again
DMB ; Required before accessing protected resource
BX lr
unlock_mutex
DMB ; Ensure accesses to protected resource have completed
MOV r1, #UNLOCKED ; Write "unlocked" into lock field
STR r1, [r0]
DSB ; Ensure update of the mutex occurs before other CPUs wake
SEV ; Send event to other CPUs, wakes any CPU waiting on using WFE
BX lr
Usage of memory barrier instructions
– Memory Remapping
• Consider a situation where your reset handler/boot code lives in Flash memory (ROM),
which is aliased to address 0x0 to ensure that your program boots correctly from the vector
table, which normally resides at the bottom of memory (see left-hand-side memory map).
• After you have initialized your system, you may wish to turn off the Flash memory alias so
that you can use the bottom portion of memory for RAM (see right-hand-side memory
map). The following code (running from the permanent Flash memory region) disables the
Flash alias, before calling a memory block copying routine (e.g., memcpy) to copy some
data from to the bottom portion of memory (RAM).
44
MOV r0, #0
MOV r1, #REMAP_REG
STR r0, [r1] ; Disable Flash alias
BL block_copy_routine() ; Block copy code into RAM
BL copied_routine() ; Execute copied routine (now in RAM)
DMB ; Ensure above str completion with DMB
DSB ; Ensure block copy is completed with DSB
ISB ; Ensure pipeline flush with ISB
Question
Usage of memory barrier instructions
– Self-modifying code
– If the memory you are performing the block copying routine on is marked as 'cacheable'
the instruction cache will need to be invalidated so that the processor does not execute
any other 'cached' code.
– For "write-back" regions the data cache must be cleaned before the instruction cache
invalidate.
45
Overlay_manager
; ...
BL block_copy ; Copy new routine from ROM to RAM
B relocated_code ; Branch to new routine
Overlay_manager
; ...
BL block_copy ; Copy new routine from ROM to RAM
data_cache_clean ; Clean the cache so that the new routine is written out to memory
icache_and_pb_invalidate ; Invalidate the instruction cache and branch predictor so that the
; old routine is no longer cached
B relocated_code ; Branch to new routine
DSB ; Ensure block copy has completed
ISB ; Flush pipeline to ensure processor fetches new instructions
DSB ; Ensure data cache clean has completed
DSB ; Ensure invalidate has completed
ISB ; Flush pipeline to ensure processor fetches new instructions

Memory model

  • 1.
  • 2.
  • 3.
    Background •Synchronization of multithread program –Mutex (mutual exclusion) • Ensuring that no two processes or threads are in their critical section at the same time – Here, a critical section refers to a period of time when the process accesses a shared resource, such as shared memory 3
  • 4.
    Background – Semaphore • Amutex is essentially the same thing as a binary semaphore, and sometimes uses the same basic implementation • However, the term "mutex" is used to describe a construct which prevents two processes from accessing a shared resource concurrently • The term "binary semaphore" is used to describe a construct which limits access to a single resource • In many cases a mutex has a concept of an “owner” – the process which locked the mutex is the only process allowed to unlock it. In contrast, semaphores generally do not have this restriction – Semaphore vs. mutex • http://www.kernel.org/doc/Documentation/mutex-design.txt 4
  • 5.
    Synchronization and mutex Common synchronizationmethods 5 Reference: http://msdn.microsoft.com/e n-us/library/ms810047.aspx Windows mutex mechanisms Type of mutex IRQL considerations Recursion and thread details Interrupt spin lock Acquisition raises IRQL to DIRQ and returns previous IRQL to caller. Not recursive. Release on same thread as acquire. Spin lock Acquisition raises IRQL to DISPATCH_LEVEL and returns previous IRQL to caller. Not recursive. Release on same thread as acquire. Queued spin lock Acquisition raises IRQL to DISPATCH_LEVEL and stores previous IRQL in lock owner handle. Not recursive. Release on same thread as acquire. Fast mutex Acquisition raises IRQL to APC_LEVEL and stores previous IRQL in lock. Not recursive. Release on same thread as acquire. Kernel mutex (a kernel dispatcher object) Enters critical region upon acquisition and leaves critical region upon release. Recursive. Release on same thread as acquire. Synchronization event (a kernel dispatcher object) Acquisition does not change IRQL. Wait at IRQL <= APC_LEVEL and signal at IRQL <= DISPATCH_LEVEL. Not recursive. Release on the same thread or on a different thread. Unsafe fast mutex Acquisition does not change IRQL. Acquire and release at IRQL <= APC_LEVEL. Not recursive. Release on same thread as acquire. Synchronization method Description Windows mechanisms Interlocked operations Provides atomic logical, arithmetic, and list manipulation operations that are both thread-safe and multiprocessor safe. InterlockedXxx and ExInterlockedXxx routines Mutexes Provides (mutually) exclusive access to memory. Spin locks, fast mutexes, kernel mutexes, synchronization events Shared/exclusive lock Allows one thread to write or many threads to read the protected data. Executive resources Counted semaphore Allows a fixed number of acquisitions. Semaphores
  • 6.
    What is wrongwith Mutexes? • Mutexes are perfectly fine, but you have a problem if there is lock contention – If you want your algorithm to be fast, you want to use the available cores as much as possible instead of letting them sleep – A thread can hold a mutex and be de-scheduled by the CPU (because of a cache miss or its time slice is over), then all the threads that want to acquire this mutex will be blocked – And if you have a lot of blocking, the OS also needs to do more context switches which are expensive because they clear the caches 6 Reference: http://woboq.com/blog/introduction -to-lockfree-programming.html
  • 7.
    What is wrongwith Mutexes? • Problems with locking – Deadlock – Priority Inversion • Low-priority processes hold a lock required by a higher priority process – Convoying • All the other processes slow to the speed of the slowest one – Async-signal-safety • Signal handlers can’t use lock-based primitives – Kill-tolerant availability • What happens if threads are killed/crash while holding locks – Pre-emption tolerance • What happens if you’re pre-empted holding a lock – Overall performance 7 Reference: http://www.cs.cmu.edu/~410- s05/lectures/L31_LockFree.pdf
  • 8.
    So how canwe do it without locking? • Lock-free Programming – Thread-safe access to shared data without the use of synchronization primitives such as mutexes – Practical with hardware support • Modern CPUs have something called atomic operations • The use of shared memory and an atomic instruction provides the mutual exclusion 8
  • 9.
    Atomic operation • Atomicoperation – Processors have instructions that can be used to implement lock- free and wait-free algorithms • Atomic read-write • Atomic swap, also called XCHG • Test-and-set • Fetch-and-add • Compare-and-swap (CAS) – Compare and Exchange (CMPXCHG) instruction in the x86 and Itanium architectures – ABA problem » http://woboq.com/blog/introduction-to-lockfree-programming.html 9 Reference: http://en.wikipedia.org/wiki/Atomic_operation http://en.wikipedia.org/wiki/Read-modify-write
  • 10.
    Atomic operation • Load-Link/Store-Conditional –The LDREX and STREX instructions in ARM split the operation of atomically updating memory into two separate steps. Together, they provide atomic updates in conjunction with exclusive monitors that track exclusive memory accesses. Load-Exclusive and Store-Exclusive must only access memory regions marked as Normal – For example » LDREX R1, [R0] performs a Load-Exclusive from the address in R0, places the value into R1 and updates the exclusive monitor(s). » STREX R2, R1, [R0] performs a Store-Exclusive operation to the address in R0, conditionally storing the value from R1 and indicating success or failure in R2. 10 Reference: http://infocenter.arm.com/help/topic/co m.arm.doc.dht0008a/ch01s02s01.html http://infocenter.arm.com/help/topic/co m.arm.doc.dht0008a/CJAGCFAF.html Exclusive accesses to memory locations marked as Non-shareable are checked only against this local monitor. Exclusive accesses to memory locations marked as Shareable are checked against both the local monitor and the global monitor.
  • 11.
    Atomic operation • GCCBuilt-in functions for atomic memory access – http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Atomic-Builtins.html • Atomic operations supported in Linux Kernel – https://www.kernel.org/doc/Documentation/atomic_ops.txt • Atomic operations supported in C11/C++11 – C11 defines a new _Atomic() type specifier. You can declare an atomic integer like this: _Atomic(int) counter; – C++11 moves this declaration into the standard library: #include <atomic> std::atomic<int> counter; 11 Reference: http://www.informit.com/articles /article.aspx?p=1832575
  • 12.
    Atomic operation • Isatomic operation enough? • Linux-v3.7.8/arch/arm/include/asm/atomic.h 12 static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) { unsigned long oldval, res; smp_mb(); do { __asm__ __volatile__("@ atomic_cmpxchgn" "ldrex %1, [%3]n" "mov %0, #0n" "teq %1, %4n" "strexeq %0, %5, [%3]n" : "=&r" (res), "=&r" (oldval), "+Qo" (ptr->counter) : "r" (&ptr->counter), "Ir" (old), "r" (new) : "cc"); } while (res); smp_mb(); return oldval; } Reference: http://lxr.linux.no/#linux+v3.7.8/arch/ar m/include/asm/atomic.h#L115 Before talking about memory barrier, let’s see memory ordering first. Memory barrier
  • 13.
  • 14.
    Memory ordering • Memoryordering - memory access ordering – Program order • the order of the program’s object code as seen by the CPU, which might differ from the order in the source code due to compiler optimizations – Execution order • It can differ from program order due to both compiler and CPU implementation optimizations – Perceived order • It can differ from the execution order due to caching, interconnect, and memory- system optimizations • Why memory reordering – Performance! 14 Reference: http://www.rdrop.com/users/paulmck/sca lability/paper/ordering.2007.09.19a.pdf http://preshing.com/20120930/weak-vs- strong-memory-models
  • 15.
    Memory consistency models •Memory models – memory consistency models • Sequential consistency – all reads and all writes are in-order • Relaxed consistency – Some types of reordering are allowed • Loads can be reordered after loads (for better working of cache coherency, better scaling) • Loads can be reordered after stores • Stores can be reordered after stores • Stores can be reordered after loads • Weak consistency – Reads and Writes are arbitrarily reordered, limited only by explicit memory barriers 15
  • 16.
    Weak VS. Strongmemory model 16 Reference: http://preshing.com/20120930/ weak-vs-strong-memory-models
  • 17.
    Memory ordering insome architectures 17 SPARC TSO = total-store order (default) SPARC RMO = relaxed-memory order (not supported on recent CPUs) SPARC PSO = partial store order (not supported on recent CPUs) Type Alpha ARMv7 PA-RISC POWER SPARC RMO SPARC PSO SPARC TSO x86 x86 oostore AMD64 IA-64 zSeries Loads reordered after loads Y Y Y Y Y Y Y Loads reordered after stores Y Y Y Y Y Y Y Stores reordered after stores Y Y Y Y Y Y Y Y Stores reordered after loads Y Y Y Y Y Y Y Y Y Y Y Y Atomic reordered with loads Y Y Y Y Y Atomic reordered with stores Y Y Y Y Y Y Dependent loads reordered Y Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y Reference: http://en.wikipedia.org/wiki/Memory_ordering
  • 18.
    Types of MemoryBarrier • #LoadLoad • #StoreStore • #LoadStore • #StoreLoad – A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier 18 Reference: http://preshing.com/20120710/memory- barriers-are-like-source-control-operations if (IsPublished) // Load and check shared flag { LOADLOAD_FENCE(); // Prevent reordering of loads return Value; // Load published value } Value = x; // Publish some data STORESTORE_FENCE(); IsPublished = 1; // Set shared flag to indicate availability of data
  • 19.
    Memory barrier incompiler • GCC compiler memory barrier – These barriers prevent a compiler from reordering instructions, they do not prevent reordering by CPU. • GCC support for hardware memory barriers – This builtin issues a full memory barrier. 19 Reference: http://en.wikipedia.org/wiki/Memory_ordering http://gcc.gnu.org/onlinedocs/gcc- 4.6.3/gcc/Atomic-Builtins.html asm volatile("" ::: "memory"); or __asm__ __volatile__ ("" ::: "memory"); _sync_synchronize (...);
  • 20.
    Memory barriers inLinux kernel • General barrier – barrier() • Compiler barrier only. The compiler will not reorder memory accesses from one side of this statement to the other. This has no effect on the order that the processor actually executes the generated instructions. • Mandatory barriers – mb() • A full system memory barrier. All memory operations before the mb() in the instruction stream will be committed before any operations after the mb() are committed. This ordering will be visible to all bus masters in the system. It will also ensure the order in which accesses from a single processor reaches slave devices. – rmb() • Like mb(), but only guarantees ordering between read accesses. That is, all read operations before an rmb() will be committed before any read operations after the rmb(). – wmb() • Like mb(), but only guarantees ordering between write accesses. That is, all write operations before a wmb() will be committed before any write operations after the wmb(). 20 Reference: http://blogs.arm.com/software- enablement/448-memory-access-ordering- part-2-barriers-and-the-linux-kernel/ http://www.kernel.org/doc/Documentation/ memory-barriers.txt
  • 21.
    Memory barriers inLinux kernel • SMP conditional barriers – smp_mb() • Similar to mb(), but only guarantees ordering between cores/processors within an SMP system. All memory accesses before the smp_mb() will be visible to all cores within the SMP system before any accesses after the smp_mb(). – smp_rmb() • Like smp_mb(), but only guarantees ordering between read accesses. – smp_wmb() • Like smp_mb(), but only guarantees ordering between write accesses. – SMP barriers are a subset of mandatory barriers, not a superset. • An SMP barrier cannot replace a mandatory barrier, but a mandatory barrier can replace an SMP barrier. • Implicit barriers – Locking constructs in the kernel act as implicit SMP barriers, in the same way as pthread synchronization operations do in user space. – I/O accessor macros (readb(), iowrite32()) for the ARM architecture act as explicit memory barriers when kernel is compiled with CONFIG_ARM_DMA_MEM_BUFFERABLE. This was added in linux-2.6.35. • arch/arm/include/asm/io.h • arch/arm/mm/Kconfig 21 Reference: https://www.kernel.org/doc/Documentatio n/memory-barriers.txt
  • 22.
  • 23.
    Memory ordering inARM Architecture • Memory types – Normal memory • Normal memory is effectively for all of your data and executable code • This memory type permits speculative reads, merging of accesses and repeating of reads without side effects • Accesses to Normal memory can always be buffered, and in most situations they are also cached - but they can be configured to be uncached • There is no implicit ordering of Normal memory accesses – Device memory and Strongly-ordered memory • Used with memory mapped peripherals or other control registers • Processors implementing the LPAE treat Device and Strongly-ordered memory regions identically • ARMv7-A processors that do not implement the LPAE can set device memory to be Shareable or Non-shareable • Accesses to these types of memory must happen exactly the number of times that executing the program suggests they should • There is no guarantee about ordering between memory accesses to different devices, or usually between accesses of different memory types 23 Reference: http://blogs.arm.com/software-enablement/594- memory-access-ordering-part-3-memory-access- ordering-in-the-arm-architecture/
  • 24.
    Memory ordering inARM Architecture • Arranges of ARM Memory Types – Normal • Shareable or Non-shareable • Cacheable or Non-cacheable – Device (w/o LPAE) • Shareable or Non-shareable – Device (w LPAE) • Always shareable – Strongly-ordered • Always shareable • Have to wait slave’s access ACK 24 ARM ® Architecture Reference Manual ARMv7-A and ARMv7-R edition
  • 25.
    Memory ordering inARM Architecture • Figure A3-5 shows the memory ordering between two explicit accesses A1 and A2, where A1 occurs before A2 in program order  < Accesses must arrive at any particular memory-mapped peripheral or block of memory in program order, that is, A1 must arrive before A2. There are no ordering restrictions about when accesses arrive at different peripherals or blocks of memory.  – Accesses can arrive at any memory-mapped peripheral or block of memory in any order. 25
  • 26.
    Memory ordering inARM Architecture • Barriers – Barriers were introduced progressively into the ARM architecture • Some ARMv5 processors, such as the ARM926EJ-S, implemented a Drain Write Buffer cp15 operation, which halted execution until any buffered writes had drained into the external memory system • With the introduction of the ARMv6 memory model, this operation was redefined in more architectural terms and became the Data Synchronization Barrier – ARMv6 also introduced the new Data Memory Barrier and Flush Prefetch Buffer cp15 operations • ARMv7 evolved the memory model somewhat, extending the meaning of the barriers - and the Flush Prefetch Buffer operation was renamed the Instruction Synchronization Barrier • ARMv7 also allocated dedicated instruction encodings for the barrier operations – Use of the cp15 operations is now deprecated and software targeting ARMv7 or later should use the DMB, DSB and ISB mnemonics. • And finally, ARMv7 extended the Shareability concept to cover both Inner-shareable and Outer-shareable domains – This together with AMBA4 ACE gives us barriers that propagate into the memory system 26
  • 27.
    Memory ordering inARM Architecture – Instruction Synchronization Barrier (ISB) • The ISB ensures that any subsequent instructions are fetched anew from cache in order that privilege and access is checked with the current MMU configuration – It is used to ensure any previously executed context changing operations will have completed by the time the ISB completed • Access type and domain are not really relevant for this barrier – It is not used in any of the Linux memory barrier primitives, but appears in memory management, cache control and context switching code 27
  • 28.
    Memory ordering inARM Architecture – Data Memory Barrier (DMB) • DMB prevents reordering of data accesses instructions across itself – All data accesses by this processor/core before the DMB will be visible to all other masters within the specified shareability domain before any of the data accesses after it – It also ensures that any explicit preceding data/unified cache maintenance operations have completed before any subsequent data accesses are executed – The DMB instruction takes two optional parameters: an operation type (stores only - 'ST' - or loads and stores) and a domain – The default operation type is loads and stores and the default domain is System • In the Linux kernel, the DMB instruction is used for the smp_*mb() macros 28
  • 29.
    Memory ordering inARM Architecture – Data Synchronization Barrier (DSB) • DSB enforces the same ordering as the Data Memory Barrier – But it also blocks execution of any further instructions until synchronization is complete – It also waits until all cache and branch predictor maintenance operations have completed for the specified shareability domain – If the access type is load and store then it also waits for any TLB maintenance operations to complete. • In the Linux kernel, the DSB instruction is used for the *mb() macros. 29
  • 30.
    Domain Abbreviatio n Description Non-shareable NSH A domainconsisting only of the local agent. Accesses that never need to be synchronized with other cores, processors or devices. Not normally used in SMP systems. Inner Shareable ISH A domain potentially shared by multiple agents, but usually not all agents in the system. A system can have multiple Inner Shareable domains. An operation that affects one Inner Shareable domain does not affect other Inner Shareable domains in the system. Outer Shareable OSH A domain almost certainly shared by multiple agents, and quite likely consisting of several Inner Shareable domains. An operation that affects an Outer Shareable domain also implicitly affects all Inner Shareable domains within it. For processors such as the Cortex-A15 MPCore that implement the LPAE, all Device memory accesses are considered Outer Shareable. For other processors, the shareability attribute can be explicitly set (to shareable or non-shareable). Full system SY An operation on the full system affects all agents in the system; all Non-shareable regions, all Inner Shareable regions and all Outer Shareable regions. Simple peripherals such as UARTs, and several more complex ones, are not normally necessary to have in a restricted shareability domain. Memory ordering in ARM Architecture • Shareability domains – Shareability domains define "zones" within the bus topology within which memory accesses are to be kept consistent (taking place in a predictable way) and potentially coherent (with hardware support) – Outside of this domain, observers might not see the same order of memory accesses as inside it 30 Reference: http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/CIHGHHIE.html ARMv7
  • 31.
    Memory ordering inARM Architecture 31 Allocated values for the data barriers (DMB/DSB)ARMv8
  • 32.
    Memory ordering inARM Architecture • The shareability domains example 32 4 cores per cluster, 2 clusters per chip
  • 33.
  • 34.
    Memory model supportedin C++11 • C++ Memory model – Sequential consistent/acquire-release/relaxed • http://en.cppreference.com/w/cpp/atomic/memory_order • http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html 34
  • 35.
    Acquire and ReleaseSemantics • ARMv8 AArch64/AArch32 support load-acquire/store-release instructions – The Load-Acquire/Store-Release instructions can remove the requirement to use the explicit DMB memory barrier instruction 35 Reference: http://preshing.com/20120913/acq uire-and-release-semantics http://www.arm.com/files/downloa ds/ARMv8_Architecture.pdf Acquire semantics is a property which can only apply to operations which read from shared memory. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order. Release semantics is a property which can only apply to operations which write to shared memory. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.
  • 36.
    Acquire and ReleaseSemantics • An demo example 36 //Shared global variables int A = 0; int Ready = 0; //Thread 1 A = 42; Ready = 1; //Thread 2 int r1 = Ready; int r2 = A; //Possible results of r1, r2 r1 =0 r2 = 0 r2 = 42 r1 = 1 r2 = 0 r2 = 42 //Shared global variables int A = 0; Atomic<int> Ready = 0; //Thread 1 A = 42; Ready.store(1, memory_order_release); //Thread 2 int r1 = Ready.load(memory_ord er_acquire); int r2 = A; //Possible results of r1, r2 r1 =0 r2 = 0 r2 = 42 r1 = 1 r2 = 42
  • 37.
    Acquire and ReleaseSemantics • A Write-Release Can Synchronize-With a Read-Acquire 37 // Thread 1 void SendTestMessage(void* param) { // Copy to shared memory using non-atomic stores. g_payload.tick = clock(); g_payload.str = "TestMessage"; g_payload.param = param; // Perform an atomic write-release to indicate that the message is ready. g_guard.store(1, std::memory_order_release); } // Thread 2 bool TryReceiveMessage(Message& result) { // Perform an atomic read-acquire to check whether the message is ready. int ready = g_guard.load(std::memory_order_acquire); if (ready != 0) { // Yes. Copy from shared memory using non-atomic loads. result.tick = g_payload.tick; result.str = g_msg_str; result.param = g_payload.param; return true; } // No. return false; } Reference: http://preshing.com/20130823/ the-synchronizes-with-relation/
  • 38.
  • 39.
    Volatile V.S. memory-order/atomic •What does the volatile keyword mean? 39 Reference: http://www.drdobbs.com/parallel/vola tile-vs-volatile/212701484
  • 40.
    Volatile V.S. memory-order/atomic •C programmers have often taken volatile to mean that the variable could be changed outside of the current thread of execution – as a result, they are sometimes tempted to use it in kernel code when shared data structures are being used – In other words, they have been known to treat volatile types as a sort of easy atomic variable, which they are not – The use of volatile in kernel code is almost never correct • The key point to understand with regard to volatile is that its purpose is to suppress optimization, which is almost never what one really wants to do • In the kernel, one must protect shared data structures against unwanted concurrent access, which is very much a different task • Like volatile, the kernel primitives which make concurrent access to data safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent unwanted optimization. If they are being used properly, there will be no need to use volatile as well 40 Reference: https://www.kernel.org/doc/Document ation/volatile-considered-harmful.txt
  • 41.
    Volatile V.S. memory-order/atomic •To safely write lock-free code that communicates between threads without using locks – prefer to use ordered atomic variables – Java/.NET volatile, C++0x atomic<T>, and C-compatible atomic_T • To safely communicate with special hardware or other memory that has unusual semantics – use un-optimizable variables: ISO C/C++ volatile – Remember that reads and writes of these variables are not necessarily atomic • To protect shared data structures against unwanted concurrent access in kernel code – use kernel concurrent access primitives, like spinlocks, mutexes, memory barriers • Finally, to express a variable that both has unusual semantics and has any or all of the atomicity and/or ordering guarantees needed for lock-free coding – only the ISO C++11 Standard provides a direct way to spell it: volatile atomic<T> 41
  • 42.
    USAGE OF MEMORYBARRIER 42
  • 43.
    Usage of memorybarrier instructions • In what situations might I need to insert memory barrier instructions? – Mutexes 43 Reference: http://infocenter.arm.com/help/topic/ com.arm.doc.genc007826/Barrier_Lit mus_Tests_and_Cookbook_A08.pdf http://infocenter.arm.com/help/topic/ com.arm.doc.faqs/ka14041.html LOCKED EQU 1 UNLOCKED EQU 0 lock_mutex ; Is mutex locked? LDREX r1, [r0] ; Check if locked CMP r1, #LOCKED ; Compare with "locked" WFEEQ ; Mutex is locked, go into standby BEQ lock_mutex ; On waking re-check the mutex ; Attempt to lock mutex MOV r1, #LOCKED STREX r2, r1, [r0] ; Attempt to lock mutex CMP r2, #0x0 ; Check whether store completed BNE lock_mutex ; If store failed, try again DMB ; Required before accessing protected resource BX lr unlock_mutex DMB ; Ensure accesses to protected resource have completed MOV r1, #UNLOCKED ; Write "unlocked" into lock field STR r1, [r0] DSB ; Ensure update of the mutex occurs before other CPUs wake SEV ; Send event to other CPUs, wakes any CPU waiting on using WFE BX lr
  • 44.
    Usage of memorybarrier instructions – Memory Remapping • Consider a situation where your reset handler/boot code lives in Flash memory (ROM), which is aliased to address 0x0 to ensure that your program boots correctly from the vector table, which normally resides at the bottom of memory (see left-hand-side memory map). • After you have initialized your system, you may wish to turn off the Flash memory alias so that you can use the bottom portion of memory for RAM (see right-hand-side memory map). The following code (running from the permanent Flash memory region) disables the Flash alias, before calling a memory block copying routine (e.g., memcpy) to copy some data from to the bottom portion of memory (RAM). 44 MOV r0, #0 MOV r1, #REMAP_REG STR r0, [r1] ; Disable Flash alias BL block_copy_routine() ; Block copy code into RAM BL copied_routine() ; Execute copied routine (now in RAM) DMB ; Ensure above str completion with DMB DSB ; Ensure block copy is completed with DSB ISB ; Ensure pipeline flush with ISB Question
  • 45.
    Usage of memorybarrier instructions – Self-modifying code – If the memory you are performing the block copying routine on is marked as 'cacheable' the instruction cache will need to be invalidated so that the processor does not execute any other 'cached' code. – For "write-back" regions the data cache must be cleaned before the instruction cache invalidate. 45 Overlay_manager ; ... BL block_copy ; Copy new routine from ROM to RAM B relocated_code ; Branch to new routine Overlay_manager ; ... BL block_copy ; Copy new routine from ROM to RAM data_cache_clean ; Clean the cache so that the new routine is written out to memory icache_and_pb_invalidate ; Invalidate the instruction cache and branch predictor so that the ; old routine is no longer cached B relocated_code ; Branch to new routine DSB ; Ensure block copy has completed ISB ; Flush pipeline to ensure processor fetches new instructions DSB ; Ensure data cache clean has completed DSB ; Ensure invalidate has completed ISB ; Flush pipeline to ensure processor fetches new instructions

Editor's Notes

  • #13 http://lists.infradead.org/pipermail/linux-arm-kernel/2010-July/019912.html http://home.deib.polimi.it/silvano/FilePDF/ARC-MULTIMEDIA/ARM_IEEEComputer_July2005_01463106.pdf
  • #18 There can be incoherent instruction cache pipeline, which prevent self-modifying code to be executed without special ICache flush/reload instructions.
  • #22 Why device I/O access operations need memory barriers?
  • #25 Device: need not to wait for slave’s access ACK Strongly-ordered: have to wait slave’s access ACK
  • #46 Write through, memory non-cacheable Write back write allocate, memory cacheable