Memory model

Overview
• Simple definition of memory model.
• Optimizations.
• HW: SC & TSO & RC: Strong and Weak.
• SW: Ordering & C++11 memory model.
• Further reading.

Lock-free programming
Joggling razor blades.
--Herb Sutter
Just don’t do it, use lock!

Memory model(Consistency Model)
• “the memory model specifies the allowed
behavior of multithreaded programs executing
with shared memory.”[1]
• “consistency(memory model) provide rules
about loads and stores and how they act upon
memory.”[1]
 A contract between software and hardware.

Does the computer execute the program you wrote.
NO!
Source code
compiler
processor
caches
execution

Real-world example
g_a == 0 g_b = 42
g_a == 24 g_b = 0
g_a == 24 g_b = 42
g_a == 0 g_b == 0 ??

How dare they change my code!
• The program you wrote is not what you want.
• Transformation to make better performance.
• As long as long they have the same effects.

Optimizations
Z = 3
Y = 2
X = 1
// use X , Y, Z
X = 1
Y = 2
Z = 3
// use X ,Y, Z
X = 1
Y = 2
X = 3
// use X and Y
Y = 2
X = 3
// use X and Y
for(i = 0; i < cols; ++i)
for(j = 0; j < rows; ++j)
a[j*rows + i] += 42;
for(j = 0; j < rows; ++j)
for(i = 0; i < cols; ++i)
a[j*rows + i] += 42;
Optimizations are ubiquitous: compiler, processor will do whatever they
see fit to optimize your code to improve performance.

Memory model from HW’s perspective
Shared memory support for multicore computer
system is the source of all these difficulties.

Memory architecture
• The effect of memory operation.
Core Memory
Core1
Core2 Memory
Core3
Accesses to memory are serialized.

Cache(and store buffer)
Core 1 Core 2
Store buffer
Memory
cache
2 issues arise:
a. coherence. (invisible to software)
b. consistency.
How to order stores and loads to the memory?
core 1
S1: store data = d1
S2: store flag = d2
core 2
L1: load r1 = flag
B1: if(r1 != d2) goto L1
L2: load r2 = data
Key point:
 Writes are not automatically visible.
 Reads/writes are not necessarily performed in order.
Store buffer
cache

Program order & memory order
• Program order: the order of execution in the
program.
 what programmer wants.
• Memory order: the order of the corresponding
operation with respect to memory.
 the observed order.

Sequential consistency
Program order is the same as memory order for
every single thread.
core1 core2
If L(a) <p L(b)  L(a) <m L(b)
If L(a) <p S(b)  L(a) <m S(b)
If S(a) <p S(b)  S(a) <m S(b)
If S(a) <p L(b)  S(a) <m L(b)
Every load gets its value from the last
store before it in memory order.
store
load
store
store
load
simple & easy to program with.
performance optimizations are constrained.

Total store order(TS)
Also known as “processor consistency”, used in x86/64, SPARC, etc.
core1 core2
If L(a) <p L(b)  L(a) <m L(b)
If L(a) <p S(b)  L(a) <m S(b)
If S(a) <p S(b)  S(a) <m S(b)
If S(a) <p L(b)  S(a) <m L(b)
Every load gets its value from the last
store before it in memory order or in
program order.
store
load
store
store
load
Need fence to accomplish SC.

Memory fence
• independent memory operations are effectively
performed in random order.
• Need a way to instruct compiler and processor to
restrict the order.
 Memory fence, a per cpu based intervention.
• Fences are not guaranteed to have any effect on
other cpu.
• Fences do not guarantee what order other cpu will
see.

Release consistency
• Provide 2 types of operation(fence).
a) acquire operation.
b) release operation.
Acquire operation
Release operation
Memory operations are not
allowed to move up across
Key observations:
Acquire operation indicates the start of an critical section.
Release operation indicates the end of an critical section.
Memory operations are
not allowed to move
down across

Memory model from SW’s perspective
Software memory model
X86/64 PowerPC ARM
The other part of the contract for SW to obey

Ordering
Down to the earth: it is all about side effect of the execution of your program
with respect to memory interaction.
a) Memory operations in program order are not the same as memory order.
b) Use of fence to prevent the potential ordering.

How does ordering matter
1: load(g_y)
2: load(g_x)
3: store(g_x)
4: store(g_y)
Non-deterministic reordering makes program nearly impossible to reason about.

How does ordering matter?
• One more try, Peterson’s algorithm on x86/64.
int g_victim;
bool g_flag[2];
void lock1()
{
g_flag[o] = true;
g_victim = 0;
while (g_flag[1] && g_victim == 0);
// lock acquired.
}
void unlock1()
{
g_flag[0] = false;
}
void lock2()
{
g_flag[1] = true;
g_victim = 1;
while (g_flag[0] && g_victim == 1);
// lock acquired.
}
void unlock2()
{
g_flag[1] = false;
}
Thread 0
Store(g_flag[0])
Store(g_victim)
Load(g_flag[1])
Load(g_victim)
Thread 1
Store(g_flag[1])
Store(g_victim)
Load(g_flag[0])
Load(g_victim)

Is reordering that bad?
 Yes.  No.
 It depends.
As long as we don’t see the reordering, whatever it is!
 Hardware loves to do reordering in order to optimize
performance.
 Software, however, need SC to ensure correct code.

SC-DRF
• Fully sequential consistency, ideal world.
 execute the code you wrote.
 what most programmers expect.
• SC-DRF: sequential consistency for data race free,
the reality.
 Compromise between software and hardware!
As long as you don’t write data race code, HW guarantees you the illusion
of fully sequential consistency.

Race condition
• A memory location is simultaneously accessed
by two or more threads, and at least one
thread is a writer.
• Key point: transaction.
1) atomic: no torn-read or torn-write.
2) visibility: propagate side effect from thread
to thread.

Critical section
• Race condition occurs only when we have to
manipulate shared variables.
• Create a critical region to serialize the accesses.
 a way to implement transaction.

Critical section
Execution of shared variables
Good fence makes good neighbor
Reordering within critical section?
 As long as they don’t move out
of the section.
Acquire fence
Release fence


 Full fence will work, but acquire and release operation are better.

c++11 atomic
• Operations on atomic type are performed atomically, AKA, synchronization operations.
• User can specify the memory ordering for every load & store.
template <class T> struct atomic {
bool is_lock_free() const noexcept;
void store(T, memory_order = memory_order_seq_cst) noexcept;
T load(memory_order = memory_order_seq_cst) const noexcept;
T exchange(T, memory_order = memory_order_seq_cst) noexcept;
bool compare_exchange_weak(T&, T, memory_order, memory_order) noexcept;
bool compare_exchange_strong(T&, T, memory_order, memory_order) noexcept;
bool compare_exchange_weak(T&, T, memory_order = memory_order_seq_cst) noexcept;
bool compare_exchange_strong(T&, T, memory_order = memory_order_seq_cst) noexcept;
};
Synchronization operations specify how assignments in one thread visible to another.
[c++ standard: 1.10.5]

C++11 memory order
namespace std {
typedef enum memory_order {
memory_order_relaxed, // no ordering constraint.
memory_order_acquire, // load operation using this order is an acquire operation.
memory_order_consume, // a weaker version of acquire semantic.
memory_order_release, // store operation using this order is an release operation.
memory_order_acq_rel, // both, for RMW operation: eg, exchange().
memory_order_seq_cst // sequential consistency, like memory_order_acq_rel,
// plus a single total order on all memory_order_acq_rel operation.
} memory_order;
}
Note: applied only to read and write performed to the same memory location.

Acquire/release and Consume/release
atomic<int> guard(0);
int pay_load = 0;
// thread 0
pay_load = 1;
guard.store(1, memory_order_release);
// thread 1
int pay;
int g = guard.load(memory_order_acquire);
If (g) pay = pay_load;
atomic<int*> guard(0);
int pay_load = 0;
// thread 0
pay_load = 1;
guard.store(&pay_load, memory_order_release);
// thread 1
int pay;
Int* g = guard.load(memory_order_consume);
If (g) pay = *g;
 g mush carry a dependency to pay = *g
 data dependency
On most weak-order architectures, memory ordering between data dependent
instructions is preserved, in such case explicit memory fence is not necessary.[7]

memory_order_seq_cst
• Order memory operation the same way as release and
acquire.
• Establish a single total order on all memory_order_seq_cst
operations.
Suppose x,y are atomic variables and are initialized to 0.[6]
Thread 1
x = 1
Thread 2
y = 1
Thread 3
if (y = 1 && x == 0)
cout << “y first”;
Thread 4
if (y = 0 && x == 1)
cout << “x first”;
 Must not allow to print both messages.

C++11 memory fence
data.store(3, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag.store(1, std::memory_order_relaxed);
flag2.store(2, std::memory_order_relaxed);
• It is different from what you think comparing to a
traditional fence.
• More like a way to do synchronization.
Above code is NOT equivalent to the following:
extern "C" void atomic_thread_fence(memory_order order) noexcept;
data.store(3, std::memory_order_relaxed);
flag.store(1, std::memory_order_release);
flag2.store(2, std::memory_order_relaxed);
 A release fence prevents all preceding memory operations from reordered past all
subsequent writes.
 flag2.store() is allowed reorder before data.store().
// other memory operation preceding the fence.
std::atomic_thread_fence(std::memory_order_release);
flag.store(1, std::memory_order_relaxed);
An acquire fence prevents all subsequent memory operations from reordered
past the all preceding read.
flag.load(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
// other memory operations.

Quiz
Hint:
a. Need an acquire before load of g_y in foo1().
b. Need an acquire before load of g_x in foo2().
Can we accomplish that?
 acquire/release is pairwise operations.
State what order is needed to prevent reordering?

Quiz: Peterson’s algo again.
atomic<int> g_victim;
atomic<bool> g_flag[2];
void lock1()
{
g_flag[o].store(true, ?);
g_victim.store(0, ?);
while (g_flag[1].load(?) && g_victim.load(?) == 0);
// lock acquired.
}
void unlock1()
{
g_flag[0].store(false, ?);
}
Thread 0
Store(g_flag[0])
Store(g_victim)
Load(g_flag[1])
Load(g_victim)
Thread 1
Store(g_flag[1])
Store(g_victim)
Load(g_flag[0])
Load(g_victim)
atomic<int> g_victim;
atomic<bool> g_flag[2];
void lock1()
{
g_flag[o].store(true, memory_order_relaxed);
g_victim.exchange(0, memory_order_acq_rel);
while (g_flag[1].load(memory_order_acquire)
&& g_victim.load(memory_order_relaxed) == 0);
// lock acquired.
}
void unlock1()
{
g_flag[0].store(false, memory_order_release);
}
Atomic read-modify-write operations shall always read the last value (in the modification order) written
before the write associated with the read-modify-write operation.[standard §29.3.12]

A few terms: synchronize with
Thread 1:
• An operation A synchronizes-with an operation B if:
1) A is a store Data = 42
to some atomic variable m, with an
Flag = 1
ordering ofstd::memory_order_release,
or std::memory_order_seq_cst.
2) B is a load from the same variable m, with an ordering
of std::memory_order_acquire or std::Thread memory_2:
order_seq
_cst.
R1 = Flag
If (R1== 1) Use data
3) and B reads the value stored by A.

A few terms: dependency-ordered before
An operation A dependency-ordered before an operation B if:
1) A is a store to some atomic variable m, with an ordering
Thread 1:
ofstd::memory_order_release, or std::memory_order_seq_cst.
Data = 42
Flag = &Data
2) B is a load from the same variable m, with an ordering
of std::memory_order_consume.
Thread 2:
3) and B reads the value stored by the “release sequence headed
by A.
R1 = Flag
If (R1) Use R1

A few terms: happen before
 Sequence before: the order of evaluations within a single thread
 Or synchronize with.
 Or dependency-ordered before.
 Or concatenations of the above 3 relationships
with 2 exceptions.[standard 1.10.11]
 happen-before indicates visibility.

volatile
• A compiler aware semantic.
 compiler guarantees that no reordering, no
optimization enforced for this variable.
 other thread may not see this guarantee.
 has nothing to do with inter-thread synchronization.
• Not an atomic operation.

Further reading
• [1]https://class.stanford.edu/c4x/Engineering/CS316/asset/A_Primer_on_M
emory_Consistency_and_Coherence.pdf
• [2]http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=F2BAAAED623
D54B73C5FF41DF14D5864?doi=10.1.1.17.8112&rep=rep1&type=pdf
• [3]http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2075.pdf
• [4]http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1942.html
• [5]http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm
• [5]http://open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html
• [6]http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-
Herb-Sutter-atomic-Weapons-1-of-2
• [7]https://www.kernel.org/doc/Documentation/memory-barriers.txt
• [8]www.preshing.com

Memory model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Memory model

Similar to Memory model (20)

Recently uploaded

Recently uploaded (20)

Memory model

Editor's Notes