SlideShare a Scribd company logo
1 of 68
Download to read offline
An Introduction to the
Formalised Memory Model
for Linux Kernel
SeongJae Park <sj38.park@gmail.com>
I, SeongJae Park
● SeongJae Park <sj38.park@gmail.com>
● Contributing to the Linux kernel just for fun and profit since 2012
● Developing GCMA (https://github.com/sjp38/linux.gcma)
● Maintaining Korean translation of Linux kernel memory barrier document
○ The translation has merged into mainline since v4.9-rc1
○ https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ko_KR/memory-barriers.txt?h=v4.9-rc1
Programmers in Multi-core Land
● Processor vendors changed their mind to increase number of cores instead of
clock speed a decade ago
○ Now, multi-core system is prevalent
○ Even octa-core mobile phone in your pocket, maybe?
● As a result, the free lunch is over;
parallel programming is essential for high performance and scalability
http://www.gotw.ca/images/CPU.png
GIVE UP :’(
Writing Correct Parallel Program is Hard
● Compilers and processors are heavily optimized for Instructions Per Cycle,
not programmer perspective goals such as response time or throughput of
meaningful (in people’s context) progress
● Nature of parallelism is counter-intuitive
○ Time is relative, before and after is ambiguous, even simultaneous available
● C language developed with Uni-Processor assumption
■ “Et tu, C?”
CPU 0 CPU 1
X = 1;
Y = 1;
X = 2;
Y = 2;
assert(Y == 2 && X == 1)
CPU 1 assertion can be true in Linux Kernel
TL; DR
● Memory operations can be reordered, merged, or discarded in any way
unless it violates memory model defined behavior
○ In short, ``We’re all mad here’’ in parallel land
● Knowing memory model is important to write correct, fast, scalable parallel
program
https://ih1.redbubble.net/image.218483193.6460/sticker,220x200-pad,220x200,ffffff.u2.jpg
Reordering for Better IPC[*]
[*]
IPC: Instructions Per Cycle
Simple Program Execution Sequence
● Programmer writes program in C-like human readable language
● Compiler translates human readable code into assembly language
● Assembler generates executable binary from the assembly code
● Processor executes instruction sequence in the binary
○ Execution result is guaranteed to be same with sequential execution;
In other words, the execution itself is not guaranteed to be sequential
#include <stdio.h>
int main(void)
{
printf("hello worldn");
return 0;
}
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
00000000: 7f45 4c46 0201 0100 0000
0000 0000 0000 .ELF............
00000010: 0200 3e00 0100 0000 3004
4000 0000 0000 ..>.....0.@.....
00000020: 4000 0000 0000 0000 d819
0000 0000 0000 @...............
00000030: 0000 0000 4000 3800 0900
4000
AssemblerCompiler
Instruction Level Parallelism (ILP)
● Pipelining introduces instruction level parallelism
○ Each instruction is splitted up into a sequence of steps;
Each step can be executed in parallel, instructions can be processed concurrently
fetch decode execute
fetch decode execute
fetch decode execute
If not pipelined, 3 cycles per instruction
3-depth pipeline can retire 3 instructions in 5 cycle: 1.7 cycles per instruction
Instruction 1
Instruction 2
Instruction 3
Dependent Instructions Harm ILP
● If an instruction is dependent to result of previous instruction,
it should wait until the previous one finishes execution
○ E.g., a = b + c;
d = a + b;
fetch decode execute
fetch decode execute
fetch decode execute
In this case, instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
7 cycles for 3 instructions: 2.3 cycles per instruction
Instruction 1
Instruction 2
Instruction 3
Instruction Reordering Helps Performance
● By reordering dependent instructions to be located in far away, total execution
time can be shorten
● If the reordering is guaranteed to not change the result of the instruction
sequence, it would be helpful for better performance
fetch decode execute
fetch decode execute
fetch decode execute
Instruction 1
Instruction 3
Instruction 2
instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
By reordering instruction 2 and 3, total execution time can be shorten
6 cycles for 3 instructions: 2 cycles per instruction
Reordering is Legal, Encouraged Behavior, But...
● If program causality is guaranteed, any reordering is legal
● Processors and compilers can make reordering of instructions for better IPC
● The program causality in C is defined with single processor environment
● IPC focused reordering doesn’t aware programmer perspective performance
goals such as throughput or latency
● On Multi-processor system, reordering could harm not only correctness, but
also performance
Counter-intuitive Nature of
Parallelism
Time is Relative (E = MC2
)
● Each CPU generates their events in their time, observes effects of events in
relative time
● It is impossible to define absolute order of two concurrent events;
Only relative observation order is possible
CPU 1 CPU 2 CPU 3 CPU 4
Generated
event 1
Generated
event 2
Observed
event 1
followed by
event 2
I observed
event 2
followed by
event 1
Event Bus
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Event A
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Event BEvent A
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Seen Event B;
Event A
Event B
Busy… ;;;
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Seen Event B;
Seen Event A;
Event B
Event A
Cache Coherency is Per-CacheLine
● It is well known that cache coherency protocol helps system memory
consistency
● In actual, it guarantees sequential consistency, but per-cache-line only
● Every CPU will eventually agree about global order of each cache-line,
but some CPU can aware the change faster than others,
order of changes between different cache-lines is not guaranteed
http://img06.deviantart.net/0663/i/2005/112/b/6/schrodinger__s_cat___2_by_firefoxcentral.jpg
System with Store Buffer and Invalidation Queue
● Store Buffer and Invalidation Queue deliver effect of event but does not
guarantee order of observation on each CPU
CPU 0
Cache
Store
Buffer
Invalidation
Queue
Memory
CPU 1
Cache
Store
Buffer
Invalidation
Queue
Bus
C-language and
Multi-Processor
C-language Doesn’t Know Multi-Processor
● By the time of initial C-language development, multi-processor was rare
● As a result, C-language has only few guarantees about memory operations
on multi-processor
● Undefined behavior is allowed for undefined case
https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming
_Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag
e,_First_Edition_Cover_(2).svg.png
Compiler Optimizes Code
● Clever compilers try hard (really hard) to optimize code for high IPC
(again, not for programmer perspective goals)
○ Converts small, private function to inline code
○ Reorder memory access code to minimize dependency
○ Simplify unnecessarily complex loops, ...
● Optimization uses term `Undefined behavior` as they want
○ It’s legal, but sometimes do insane things in programmer’s perspective
● Memory access reordering of compiler based on C-standard, which doesn’t
aware multi-processor system, can generate unintended program
● Linux kernel uses compiler directives and volatile keyword to enforce memory
ordering
● C11 has much more improvement, though
Memory Models
Memory Consistency Model (a.k.a Memory Model)
● Defines what values can be obtained by the code’s load instructions
● Each programming environment including Instruction Set Architecture,
Programming language, Operating system defines own memory model
○ Modern language memory models (e.g., Golang, Rust, Java, C11, …) aware multi-processor
http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
Each ISA Provides Specific Memory Model
● Some architectures have stricter ordering enforcement rule than others
● PA-RISC CPUs are strictest, Alpha is weakest
● Because Linux kernel supports multiple architectures, it defines its memory
model based on weakest one, the Alpha
https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
Synchronization Primitives
● Because reordering and asynchronous effect propagation is legal,
synchronization primitives are necessary to write human intuitive program
● Most memory model provides synchronization primitives including atomic
read-modify-write instructions and memory barriers.
https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
Atomic Operations
● Atomic operations are configured with multiple sub operations
○ E.g., compare-and-swap, fetch-and-add, test-and-set
● Atomic operations have mutual exclusiveness
○ Middle state of atomic operation execution cannot be seen by others
○ It can be thought of as small critical section that protected by a global lock
● Almost every hardware supports basic atomic operations
http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
Memory Barriers
● To allow synchronization of memory operations, memory model provides
enforcement primitives, namely, memory barriers
● In general, memory barriers guarantee
effects of memory operations issued before it
to be propagated to other components (e.g., processor) in the system
before memory operations issued after the barrier
CPU 1 CPU 2 CPU 3
READ A;
WRITE B;
<BARRIER>;
READ C;
READ A,
WRITE B,
than READ C
occurred
WRITE B,
READ A,
than READ C
occurred
READ A and WRITE B can be reordered but READ C is guaranteed to
be ordered after {READ A, WRITE B}
Cost of Synchronization Primitives
● Generally, synchronization primitives are expensive, unscalable
● Performance between synchronization primitives are different, though
● Correct selection and efficient use of synchronization primitives are important!
Performance of various synchronization primitives
LKMM: Linux Kernel
Memory Model
Linux Kernel Memory Model
● The memory model for Linux kernel programming environment
○ Defines what values can be obtained, given a piece of Linux kernel code, for specific load
instructions in the code
● Linux kernel original memory model is necessary because
○ It uses C99 but C99 memory model is oblivious of multi-processor environment;
C11 memory model aware of multi-processor, but Torvalds doesn’t want it
○ It supports multiple architectures with different ISA-level memory model
https://github.com/sjp38/perfbook
LKMM: Linux Kernel Memory Model
● Designed for weakest memory model architecture, Alpha
○ Almost every combination of reordering is possible, doesn’t provide address dependency
● Atomic instructions
○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), …
○ Most machines have counterpart instructions,
but kernel atomic RMW primitives provide general guarantees at Kernel level
● Memory barriers
○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ...
○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), …
○ Semantic barriers: ACQUIRE operations, RELEASE operations, …
○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt
● Because different memory ordering primitive has different cost, only
necessary ordering primitives should be used in necessary case for high
performance and scalability
Informal LKMM
● Originally, LKMM was just an informal text, ‘memory-barriers.txt’
● It explains about the Linux Kernel Memory Model in English
(There is Korean translation, too: https://www.kernel.org/doc/Documentation/translations/ko_KR/memory-barriers.txt)
● To use the LKMM to prove your code, you should use Feynman Algorithm
○ Write down your code
○ Think real hard with the ‘memory-barriers.txt’
○ Write down your provement
○ (Hard and unreliable, of course!)
https://www.kernel.org/doc/Documentation/memory-barriers.txt
Formal LKMM: Help Arrives at Last
● It is formal, executable memory model
○ It receives C-like simple code as input
○ The code containing parallel code snippets and a question: can this result happen?
● Based on herd7 and klitmus7
○ LKMM extends herd7 and klitmus7 to support LKMM ordering primitives in code
○ Herd7 simulates in user mode, klitmus7 runs in real kernel mode
● Few limitations exist, of course
https://i.pinimg.com/originals/a9/5f/cd/a95fcd3519fe3222f07d59b0c1536305.png
LKMM Demonstration
● Installation
○ LKMM is merged in Linux source tree at tools/memory-model;
Just pull the linux source code
○ Install herdtools7 (https://github.com/herd/herdtools7)
● Usage
○ Using herd7 user mode simulation
$ herd7 -conf linux-kernel.cfg <your-litmus-test file>
○ Using klitmus7 based real kernel mode execution
$ mkdir mymodules
$ klitmus7 -o mymodules <your-litmus-test file>
$ cd mymodules ; make
$ sudo sh run.sh
● That’s it! Now you can prove your parallel code for all Linux environments!
Summary
● Nature of Parallel Land is counter-intuitive
○ Cannot define order of events without interaction
○ Ordering rule is different for different environment
○ Memory model defines their ordering rule
○ In short, they’re all mad here
● For human-intuitive and correct program, interaction is necessary
○ Almost every environment provides memory ordering primitives including atomic instructions
and memory barriers, which is expensive in common
○ Memory model defines what result can occur and cannot with given code snippet
● Formal Linux kernel memory model is available
○ Linux kernel provides its memory model based on weakest memory ordering rule architecture
it supports, the Alpha, C99, and its original ordering primitives including RCU
○ Formal LKMM using herd7 is merged to mainstream; now you can prove your parallel code!
Thanks
This work by SeongJae Park is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
Case Studies
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
A = 1;
B = 1;
while (B == 0) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
A = 1;
B = 1;
while (B == 0) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
:)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
A = 1;
B = 1;
while (B == 1) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
:)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
B = 1;
A = 1;
while (B == 0) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
B = 1;
A = 1;
while (B == 0) {}
C = 1;
X = A;
Z = C;
assert(z == 0 || x == 1)
?????
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
● Memory barrier enforces operations specified before it appear as happened to
operations specified after it
CPU 0 CPU 1 CPU 2
A = 1;
wmb();
B = 1;
while (B == 0) {}
mb();
C = 1;
Z = C;
rmb();
X = A;
assert(z == 0 || x == 1)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
● Memory barrier enforces operations specified before it appear as happened to
operations specified after it
● In some architecture, even Transitivity is not guaranteed
○ Transitivity: B happened after A; C happened after B; then C happened after A
CPU 0 CPU 1 CPU 2
A = 1;
wmb();
B = 1;
while (B == 0) {}
mb();
C = 1;
Z = C;
rmb();
X = A;
assert(z == 0 || x == 1)
Transitivity for Scheduler and Workers
Scheduler and each workers made consensus about order
Scheduler
Worker A
Worker B
Worker Z
...
...
What time
is it now?
Night!
Night!
Night!
...
Transitivity between Scheduler and Worker
Scheduler and each workers made consensus about order
Scheduler
Worker A
Worker B
Worker Z
...
... Yay!
...
Worker Z, all
workers agreed
that it’s night. Do
bedmaking!
Transitivity between Scheduler and Worker
Scheduler and each workers made consensus about order
But, worker B and worker Z didn’t made consensus
Scheduler
Worker A
Worker B
Worker Z
...
... !!??
...
Worker Z, I’m in
afternoon! I didn’t
tell you it’s night!
Compiler Memory Barrier
Code Example
Compiler Reordering Avoidance
● Compiler can remove loop entirely
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; i < 1000; i++)
the_var++;
}
loop:
.LFB106:
.cfi_startproc
addl $1000, the_var(%rip)
ret
.cfi_endproc
.LFE106:
Compiler Reordering Avoidance
● ACCESS_ONCE() is a compiler memory barrier implementation of Linux kernel
● Store to the_var could not be seen by others
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; ACCESS_ONCE(i) < 1000; i++)
the_var++;
}
loop:
...
movl the_var(%rip), %ecx
.L175:
...
addl $1, %eax
...
cmpl $999, %edx
jle .L175
movl %esi, the_var(%rip)
.L170:
rep ret
Compiler Reordering Avoidance
● Still, store to `the_var` not issued for every iteration
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; ACCESS_ONCE(i) < 1000; i++)
the_var++;
}
loop:
...
movl the_var(%rip), %ecx
.L175:
...
addl $1, %eax
...
cmpl $999, %edx
jle .L175
movl %esi, the_var(%rip)
.L170:
rep ret
Compiler Reordering Avoidance
● volatile enforces compiler to issue memory operation as programmer want
(Note that it is not enforced to do DRAM access)
● However, repetitive LOAD may harm performance
C code Assembly language code
static volatile int the_var;
void loop(void)
{
int i;
for (i = 0; ACCESS_ONCE(i) < 1000; i++)
the_var++;
}
loop:
...
.L174:
movl the_var(%rip), %edx
...
addl $1, %edx
movl %edx, the_var(%rip)
...
cmpl $999, %edx
jle .L174
.L170:
rep ret
.cfi_endproc
Compiler Reordering Avoidance
● Complete memory barrier can help the case
● Does memory access once and uses register for loop condition check
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; i < 1000; i++)
the_var++;
barrier();
}
loop:
.LFB106:
...
.L172:
addl $1, the_var(%rip)
subl $1, %eax
jne .L172
rep ret
.cfi_endproc
CPU Memory Barrier
Code Example
Progress perception
● Code does issue LOAD and STORE, but…
● see_progress() can see no progress because change made by a processor
propagates to other processor eventually, not immediately
C code Assembly language code
static int prgrs;
void do_progress(void)
{
prgrs++;
}
void see_progress(void)
{
static int last_prgrs;
static int seen;
static int nr_seen;
seen = prgrs;
if (seen > last_prgrs)
nr_seen++;
last_prgrs = seen;
}
do_progress:
...
addl $1, prgrs(%rip)
ret
...
see_progress:
...
movl prgrs(%rip), %eax
...
jle .L193
addl $1, nr_seen.5542(%rip)
.L193:
movl %eax, last_prgrs.5540(%rip)
ret
.cfi_endproc
Progress perception
● Read barrier and write barrier helps the situation
C code Assembly language code
static int prgrs;
void do_progress(void)
{
prgrs++;
smp_wmb();
}
void see_progress(void)
{
static int last_prgrs;
static int seen;
static int nr_seen;
smp_rmb();
seen = prgrs;
if (seen > last_prgrs)
nr_seen++;
last_prgrs = seen;
}
do_progress:
...
addl $1, prgrs(%rip)
...
sfence
ret
see_progress:
...
lfence
...
movl prgrs(%rip), %eax
...
jle .L193
addl $1, nr_seen.5542(%rip)
.L193:
movl %eax, last_prgrs.5540(%rip)
Memory Ordering of X86
Neither Loads Nor Stores Are Reordered with Likes
CPU 0 CPU 1
STORE 1 X
STORE 1 Y
R1 = LOAD Y
R2 = LOAD X
R1 == 1 && R2 == 0 impossible
Stores Are Not Reordered With Earlier Loads
CPU 0 CPU 1
R1 = LOAD X
STORE 1 Y
R2 = LOAD Y
STORE 1 X
R1 == 1 && R2 == 1 impossible
Loads May Be Reordered with Earlier Stores to
Different Locations
CPU 0 CPU 1
STORE 1 X
R1 = LOAD Y
STORE 1 Y
R2 = LOAD X
R1 == 0 && R2 == 0 possible
Intra-Processor Forwarding Is Allowed
CPU 0 CPU 1
STORE 1 X
R1 = LOAD X
R2 = LOAD Y
STORE 1 Y
R3 = LOAD Y
R4 = LOAD X
R2 == 0 && R4 == 0 possible
Stores Are Transitively Visible
CPU 0 CPU 1 CPU 2
STORE 1 X R1 = LOAD X
STORE 1 Y
R2 = LOAD Y
R3 = LOAD X
R1 == 1 && R2 == 1 && R3 == 0 impossible
Stores Are Seen in a Consistent Order by Others
CPU 0 CPU 1 CPU 2 CPU 3
STORE 1 X STORE 1 Y R1 = LOAD X
R2 = LOAD Y
R3 = LOAD Y
R4 = LOAD X
R1 == 0 && R2 == 0 && R3 == 1 && R4 == 0 impossible
X86 Memory Ordering Summary
● LOAD after LOAD never reordered
● STORE after STORE never reordered
● STORE after LOAD never reordered
● STOREs are transitively visible
● STOREs are seen in consistent order by others
● Intra-processor STORE forwarding is possible
● LOAD from different location after STORE may be reordered
● In short, quite reasonably strong enough
● For more detail, refer to `Intel Architecture Software Developer’s Manual`
Summary
● Nature of Parallel Land is counter-intuitive
○ Cannot define order of events without interaction
○ Ordering rule is different for different environment
○ Memory model defines their ordering rule
○ In short, they’re all mad here
● For human-intuitive and correct program, interaction is necessary
○ Every memory model provides synchronization primitives like atomic instruction and memory
barrier, etc
○ Such interaction is expensive in common
● Linux kernel memory model is based on weakest memory model, Alpha
○ Kernel programmers should assume Alpha when writing architecture independent code
○ Because of the expensive cost of synchronization primitives, programmer should use only
necessary primitives on necessary location

More Related Content

What's hot

Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingLubomir Rintel
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeAnne Nicolas
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsScyllaDB
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Sneeker Yeh
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...ScyllaDB
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkAnne Nicolas
 
From printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingFrom printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingThe Linux Foundation
 
Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)Sneeker Yeh
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationedlangley
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel TLV
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
NetBSD and Linux for Embedded Systems
NetBSD and Linux for Embedded SystemsNetBSD and Linux for Embedded Systems
NetBSD and Linux for Embedded SystemsMahendra M
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Anne Nicolas
 
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeKernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeAnne Nicolas
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support Hanoch Haim
 
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...ScyllaDB
 

What's hot (20)

Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver framework
 
From printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingFrom printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debugging
 
Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)
 
Kgdb kdb modesetting
Kgdb kdb modesettingKgdb kdb modesetting
Kgdb kdb modesetting
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
NetBSD and Linux for Embedded Systems
NetBSD and Linux for Embedded SystemsNetBSD and Linux for Embedded Systems
NetBSD and Linux for Embedded Systems
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
 
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeKernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support
 
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
 

Similar to An Introduction to the Formalised Memory Model for Linux Kernel

Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsScyllaDB
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
Module 3-cpu-scheduling
Module 3-cpu-schedulingModule 3-cpu-scheduling
Module 3-cpu-schedulingHesham Elmasry
 
OS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchOS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchDaniel Ben-Zvi
 
MIPS IMPLEMENTATION.pptx
MIPS IMPLEMENTATION.pptxMIPS IMPLEMENTATION.pptx
MIPS IMPLEMENTATION.pptxJEEVANANTHAMG6
 
Operating system Q/A
Operating system Q/AOperating system Q/A
Operating system Q/AAbdul Munam
 
Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBhoomil Chavda
 
linux monitoring and performance tunning
linux monitoring and performance tunning linux monitoring and performance tunning
linux monitoring and performance tunning iman darabi
 
Q2.12: Implications of Per CPU switching in a big.LITTLE system
Q2.12: Implications of Per CPU switching in a big.LITTLE systemQ2.12: Implications of Per CPU switching in a big.LITTLE system
Q2.12: Implications of Per CPU switching in a big.LITTLE systemLinaro
 
Exploiting Llinux Environment
Exploiting Llinux EnvironmentExploiting Llinux Environment
Exploiting Llinux EnvironmentEnrico Scapin
 
Nvidia tegra K1 Presentation
Nvidia tegra K1 PresentationNvidia tegra K1 Presentation
Nvidia tegra K1 PresentationANURAG SEKHSARIA
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
BSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisBSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisTamas K Lengyel
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug huntingAndrea Righi
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 

Similar to An Introduction to the Formalised Memory Model for Linux Kernel (20)

Linux Internals - Part II
Linux Internals - Part IILinux Internals - Part II
Linux Internals - Part II
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Module 3-cpu-scheduling
Module 3-cpu-schedulingModule 3-cpu-scheduling
Module 3-cpu-scheduling
 
OS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchOS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switch
 
MIPS IMPLEMENTATION.pptx
MIPS IMPLEMENTATION.pptxMIPS IMPLEMENTATION.pptx
MIPS IMPLEMENTATION.pptx
 
Operating system Q/A
Operating system Q/AOperating system Q/A
Operating system Q/A
 
Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handling
 
linux monitoring and performance tunning
linux monitoring and performance tunning linux monitoring and performance tunning
linux monitoring and performance tunning
 
Q2.12: Implications of Per CPU switching in a big.LITTLE system
Q2.12: Implications of Per CPU switching in a big.LITTLE systemQ2.12: Implications of Per CPU switching in a big.LITTLE system
Q2.12: Implications of Per CPU switching in a big.LITTLE system
 
Exploiting Llinux Environment
Exploiting Llinux EnvironmentExploiting Llinux Environment
Exploiting Llinux Environment
 
Nvidia tegra K1 Presentation
Nvidia tegra K1 PresentationNvidia tegra K1 Presentation
Nvidia tegra K1 Presentation
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
BSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisBSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysis
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug hunting
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Containers > VMs
Containers > VMsContainers > VMs
Containers > VMs
 

More from SeongJae Park

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in goSeongJae Park
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalabilitySeongJae Park
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)SeongJae Park
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golangSeongJae Park
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.aviSeongJae Park
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golangSeongJae Park
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using GolangSeongJae Park
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_dockerSeongJae Park
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot publicSeongJae Park
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.aviSeongJae Park
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internallySeongJae Park
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologueSeongJae Park
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSSeongJae Park
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflectionSeongJae Park
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution beginSeongJae Park
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without TouchingSeongJae Park
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentationSeongJae Park
 

More from SeongJae Park (19)

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in go
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalability
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golang
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.avi
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golang
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using Golang
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_docker
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot public
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internally
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologue
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCS
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflection
 
ash
ashash
ash
 
Hacktime for adk
Hacktime for adkHacktime for adk
Hacktime for adk
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution begin
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without Touching
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
 

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Recently uploaded (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

An Introduction to the Formalised Memory Model for Linux Kernel

  • 1. An Introduction to the Formalised Memory Model for Linux Kernel SeongJae Park <sj38.park@gmail.com>
  • 2. I, SeongJae Park ● SeongJae Park <sj38.park@gmail.com> ● Contributing to the Linux kernel just for fun and profit since 2012 ● Developing GCMA (https://github.com/sjp38/linux.gcma) ● Maintaining Korean translation of Linux kernel memory barrier document ○ The translation has merged into mainline since v4.9-rc1 ○ https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ko_KR/memory-barriers.txt?h=v4.9-rc1
  • 3. Programmers in Multi-core Land ● Processor vendors changed their mind to increase number of cores instead of clock speed a decade ago ○ Now, multi-core system is prevalent ○ Even octa-core mobile phone in your pocket, maybe? ● As a result, the free lunch is over; parallel programming is essential for high performance and scalability http://www.gotw.ca/images/CPU.png GIVE UP :’(
  • 4. Writing Correct Parallel Program is Hard ● Compilers and processors are heavily optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress ● Nature of parallelism is counter-intuitive ○ Time is relative, before and after is ambiguous, even simultaneous available ● C language developed with Uni-Processor assumption ■ “Et tu, C?” CPU 0 CPU 1 X = 1; Y = 1; X = 2; Y = 2; assert(Y == 2 && X == 1) CPU 1 assertion can be true in Linux Kernel
  • 5. TL; DR ● Memory operations can be reordered, merged, or discarded in any way unless it violates memory model defined behavior ○ In short, ``We’re all mad here’’ in parallel land ● Knowing memory model is important to write correct, fast, scalable parallel program https://ih1.redbubble.net/image.218483193.6460/sticker,220x200-pad,220x200,ffffff.u2.jpg
  • 6. Reordering for Better IPC[*] [*] IPC: Instructions Per Cycle
  • 7. Simple Program Execution Sequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code ● Processor executes instruction sequence in the binary ○ Execution result is guaranteed to be same with sequential execution; In other words, the execution itself is not guaranteed to be sequential #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 8. Instruction Level Parallelism (ILP) ● Pipelining introduces instruction level parallelism ○ Each instruction is splitted up into a sequence of steps; Each step can be executed in parallel, instructions can be processed concurrently fetch decode execute fetch decode execute fetch decode execute If not pipelined, 3 cycles per instruction 3-depth pipeline can retire 3 instructions in 5 cycle: 1.7 cycles per instruction Instruction 1 Instruction 2 Instruction 3
  • 9. Dependent Instructions Harm ILP ● If an instruction is dependent to result of previous instruction, it should wait until the previous one finishes execution ○ E.g., a = b + c; d = a + b; fetch decode execute fetch decode execute fetch decode execute In this case, instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) 7 cycles for 3 instructions: 2.3 cycles per instruction Instruction 1 Instruction 2 Instruction 3
  • 10. Instruction Reordering Helps Performance ● By reordering dependent instructions to be located in far away, total execution time can be shorten ● If the reordering is guaranteed to not change the result of the instruction sequence, it would be helpful for better performance fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 3 Instruction 2 instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) By reordering instruction 2 and 3, total execution time can be shorten 6 cycles for 3 instructions: 2 cycles per instruction
  • 11. Reordering is Legal, Encouraged Behavior, But... ● If program causality is guaranteed, any reordering is legal ● Processors and compilers can make reordering of instructions for better IPC ● The program causality in C is defined with single processor environment ● IPC focused reordering doesn’t aware programmer perspective performance goals such as throughput or latency ● On Multi-processor system, reordering could harm not only correctness, but also performance
  • 13. Time is Relative (E = MC2 ) ● Each CPU generates their events in their time, observes effects of events in relative time ● It is impossible to define absolute order of two concurrent events; Only relative observation order is possible CPU 1 CPU 2 CPU 3 CPU 4 Generated event 1 Generated event 2 Observed event 1 followed by event 2 I observed event 2 followed by event 1 Event Bus
  • 14. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus
  • 15. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Event A
  • 16. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Event BEvent A
  • 17. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Event A Event B Busy… ;;;
  • 18. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Seen Event A; Event B Event A
  • 19. Cache Coherency is Per-CacheLine ● It is well known that cache coherency protocol helps system memory consistency ● In actual, it guarantees sequential consistency, but per-cache-line only ● Every CPU will eventually agree about global order of each cache-line, but some CPU can aware the change faster than others, order of changes between different cache-lines is not guaranteed http://img06.deviantart.net/0663/i/2005/112/b/6/schrodinger__s_cat___2_by_firefoxcentral.jpg
  • 20. System with Store Buffer and Invalidation Queue ● Store Buffer and Invalidation Queue deliver effect of event but does not guarantee order of observation on each CPU CPU 0 Cache Store Buffer Invalidation Queue Memory CPU 1 Cache Store Buffer Invalidation Queue Bus
  • 22. C-language Doesn’t Know Multi-Processor ● By the time of initial C-language development, multi-processor was rare ● As a result, C-language has only few guarantees about memory operations on multi-processor ● Undefined behavior is allowed for undefined case https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming _Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag e,_First_Edition_Cover_(2).svg.png
  • 23. Compiler Optimizes Code ● Clever compilers try hard (really hard) to optimize code for high IPC (again, not for programmer perspective goals) ○ Converts small, private function to inline code ○ Reorder memory access code to minimize dependency ○ Simplify unnecessarily complex loops, ... ● Optimization uses term `Undefined behavior` as they want ○ It’s legal, but sometimes do insane things in programmer’s perspective ● Memory access reordering of compiler based on C-standard, which doesn’t aware multi-processor system, can generate unintended program ● Linux kernel uses compiler directives and volatile keyword to enforce memory ordering ● C11 has much more improvement, though
  • 25. Memory Consistency Model (a.k.a Memory Model) ● Defines what values can be obtained by the code’s load instructions ● Each programming environment including Instruction Set Architecture, Programming language, Operating system defines own memory model ○ Modern language memory models (e.g., Golang, Rust, Java, C11, …) aware multi-processor http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
  • 26. Each ISA Provides Specific Memory Model ● Some architectures have stricter ordering enforcement rule than others ● PA-RISC CPUs are strictest, Alpha is weakest ● Because Linux kernel supports multiple architectures, it defines its memory model based on weakest one, the Alpha https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
  • 27. Synchronization Primitives ● Because reordering and asynchronous effect propagation is legal, synchronization primitives are necessary to write human intuitive program ● Most memory model provides synchronization primitives including atomic read-modify-write instructions and memory barriers. https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
  • 28. Atomic Operations ● Atomic operations are configured with multiple sub operations ○ E.g., compare-and-swap, fetch-and-add, test-and-set ● Atomic operations have mutual exclusiveness ○ Middle state of atomic operation execution cannot be seen by others ○ It can be thought of as small critical section that protected by a global lock ● Almost every hardware supports basic atomic operations http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
  • 29. Memory Barriers ● To allow synchronization of memory operations, memory model provides enforcement primitives, namely, memory barriers ● In general, memory barriers guarantee effects of memory operations issued before it to be propagated to other components (e.g., processor) in the system before memory operations issued after the barrier CPU 1 CPU 2 CPU 3 READ A; WRITE B; <BARRIER>; READ C; READ A, WRITE B, than READ C occurred WRITE B, READ A, than READ C occurred READ A and WRITE B can be reordered but READ C is guaranteed to be ordered after {READ A, WRITE B}
  • 30. Cost of Synchronization Primitives ● Generally, synchronization primitives are expensive, unscalable ● Performance between synchronization primitives are different, though ● Correct selection and efficient use of synchronization primitives are important! Performance of various synchronization primitives
  • 32. Linux Kernel Memory Model ● The memory model for Linux kernel programming environment ○ Defines what values can be obtained, given a piece of Linux kernel code, for specific load instructions in the code ● Linux kernel original memory model is necessary because ○ It uses C99 but C99 memory model is oblivious of multi-processor environment; C11 memory model aware of multi-processor, but Torvalds doesn’t want it ○ It supports multiple architectures with different ISA-level memory model https://github.com/sjp38/perfbook
  • 33. LKMM: Linux Kernel Memory Model ● Designed for weakest memory model architecture, Alpha ○ Almost every combination of reordering is possible, doesn’t provide address dependency ● Atomic instructions ○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), … ○ Most machines have counterpart instructions, but kernel atomic RMW primitives provide general guarantees at Kernel level ● Memory barriers ○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ... ○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), … ○ Semantic barriers: ACQUIRE operations, RELEASE operations, … ○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt ● Because different memory ordering primitive has different cost, only necessary ordering primitives should be used in necessary case for high performance and scalability
  • 34. Informal LKMM ● Originally, LKMM was just an informal text, ‘memory-barriers.txt’ ● It explains about the Linux Kernel Memory Model in English (There is Korean translation, too: https://www.kernel.org/doc/Documentation/translations/ko_KR/memory-barriers.txt) ● To use the LKMM to prove your code, you should use Feynman Algorithm ○ Write down your code ○ Think real hard with the ‘memory-barriers.txt’ ○ Write down your provement ○ (Hard and unreliable, of course!) https://www.kernel.org/doc/Documentation/memory-barriers.txt
  • 35. Formal LKMM: Help Arrives at Last ● It is formal, executable memory model ○ It receives C-like simple code as input ○ The code containing parallel code snippets and a question: can this result happen? ● Based on herd7 and klitmus7 ○ LKMM extends herd7 and klitmus7 to support LKMM ordering primitives in code ○ Herd7 simulates in user mode, klitmus7 runs in real kernel mode ● Few limitations exist, of course https://i.pinimg.com/originals/a9/5f/cd/a95fcd3519fe3222f07d59b0c1536305.png
  • 36. LKMM Demonstration ● Installation ○ LKMM is merged in Linux source tree at tools/memory-model; Just pull the linux source code ○ Install herdtools7 (https://github.com/herd/herdtools7) ● Usage ○ Using herd7 user mode simulation $ herd7 -conf linux-kernel.cfg <your-litmus-test file> ○ Using klitmus7 based real kernel mode execution $ mkdir mymodules $ klitmus7 -o mymodules <your-litmus-test file> $ cd mymodules ; make $ sudo sh run.sh ● That’s it! Now you can prove your parallel code for all Linux environments!
  • 37. Summary ● Nature of Parallel Land is counter-intuitive ○ Cannot define order of events without interaction ○ Ordering rule is different for different environment ○ Memory model defines their ordering rule ○ In short, they’re all mad here ● For human-intuitive and correct program, interaction is necessary ○ Almost every environment provides memory ordering primitives including atomic instructions and memory barriers, which is expensive in common ○ Memory model defines what result can occur and cannot with given code snippet ● Formal Linux kernel memory model is available ○ Linux kernel provides its memory model based on weakest memory ordering rule architecture it supports, the Alpha, C99, and its original ordering primitives including RCU ○ Formal LKMM using herd7 is merged to mainstream; now you can prove your parallel code!
  • 39. This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 41. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1)
  • 42. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1) :)
  • 43. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 1) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1) :)
  • 44. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 B = 1; A = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1)
  • 45. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 B = 1; A = 1; while (B == 0) {} C = 1; X = A; Z = C; assert(z == 0 || x == 1) ?????
  • 46. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor ● Memory barrier enforces operations specified before it appear as happened to operations specified after it CPU 0 CPU 1 CPU 2 A = 1; wmb(); B = 1; while (B == 0) {} mb(); C = 1; Z = C; rmb(); X = A; assert(z == 0 || x == 1)
  • 47. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor ● Memory barrier enforces operations specified before it appear as happened to operations specified after it ● In some architecture, even Transitivity is not guaranteed ○ Transitivity: B happened after A; C happened after B; then C happened after A CPU 0 CPU 1 CPU 2 A = 1; wmb(); B = 1; while (B == 0) {} mb(); C = 1; Z = C; rmb(); X = A; assert(z == 0 || x == 1)
  • 48. Transitivity for Scheduler and Workers Scheduler and each workers made consensus about order Scheduler Worker A Worker B Worker Z ... ... What time is it now? Night! Night! Night! ...
  • 49. Transitivity between Scheduler and Worker Scheduler and each workers made consensus about order Scheduler Worker A Worker B Worker Z ... ... Yay! ... Worker Z, all workers agreed that it’s night. Do bedmaking!
  • 50. Transitivity between Scheduler and Worker Scheduler and each workers made consensus about order But, worker B and worker Z didn’t made consensus Scheduler Worker A Worker B Worker Z ... ... !!?? ... Worker Z, I’m in afternoon! I didn’t tell you it’s night!
  • 52. Compiler Reordering Avoidance ● Compiler can remove loop entirely C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; i < 1000; i++) the_var++; } loop: .LFB106: .cfi_startproc addl $1000, the_var(%rip) ret .cfi_endproc .LFE106:
  • 53. Compiler Reordering Avoidance ● ACCESS_ONCE() is a compiler memory barrier implementation of Linux kernel ● Store to the_var could not be seen by others C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... movl the_var(%rip), %ecx .L175: ... addl $1, %eax ... cmpl $999, %edx jle .L175 movl %esi, the_var(%rip) .L170: rep ret
  • 54. Compiler Reordering Avoidance ● Still, store to `the_var` not issued for every iteration C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... movl the_var(%rip), %ecx .L175: ... addl $1, %eax ... cmpl $999, %edx jle .L175 movl %esi, the_var(%rip) .L170: rep ret
  • 55. Compiler Reordering Avoidance ● volatile enforces compiler to issue memory operation as programmer want (Note that it is not enforced to do DRAM access) ● However, repetitive LOAD may harm performance C code Assembly language code static volatile int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... .L174: movl the_var(%rip), %edx ... addl $1, %edx movl %edx, the_var(%rip) ... cmpl $999, %edx jle .L174 .L170: rep ret .cfi_endproc
  • 56. Compiler Reordering Avoidance ● Complete memory barrier can help the case ● Does memory access once and uses register for loop condition check C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; i < 1000; i++) the_var++; barrier(); } loop: .LFB106: ... .L172: addl $1, the_var(%rip) subl $1, %eax jne .L172 rep ret .cfi_endproc
  • 58. Progress perception ● Code does issue LOAD and STORE, but… ● see_progress() can see no progress because change made by a processor propagates to other processor eventually, not immediately C code Assembly language code static int prgrs; void do_progress(void) { prgrs++; } void see_progress(void) { static int last_prgrs; static int seen; static int nr_seen; seen = prgrs; if (seen > last_prgrs) nr_seen++; last_prgrs = seen; } do_progress: ... addl $1, prgrs(%rip) ret ... see_progress: ... movl prgrs(%rip), %eax ... jle .L193 addl $1, nr_seen.5542(%rip) .L193: movl %eax, last_prgrs.5540(%rip) ret .cfi_endproc
  • 59. Progress perception ● Read barrier and write barrier helps the situation C code Assembly language code static int prgrs; void do_progress(void) { prgrs++; smp_wmb(); } void see_progress(void) { static int last_prgrs; static int seen; static int nr_seen; smp_rmb(); seen = prgrs; if (seen > last_prgrs) nr_seen++; last_prgrs = seen; } do_progress: ... addl $1, prgrs(%rip) ... sfence ret see_progress: ... lfence ... movl prgrs(%rip), %eax ... jle .L193 addl $1, nr_seen.5542(%rip) .L193: movl %eax, last_prgrs.5540(%rip)
  • 61. Neither Loads Nor Stores Are Reordered with Likes CPU 0 CPU 1 STORE 1 X STORE 1 Y R1 = LOAD Y R2 = LOAD X R1 == 1 && R2 == 0 impossible
  • 62. Stores Are Not Reordered With Earlier Loads CPU 0 CPU 1 R1 = LOAD X STORE 1 Y R2 = LOAD Y STORE 1 X R1 == 1 && R2 == 1 impossible
  • 63. Loads May Be Reordered with Earlier Stores to Different Locations CPU 0 CPU 1 STORE 1 X R1 = LOAD Y STORE 1 Y R2 = LOAD X R1 == 0 && R2 == 0 possible
  • 64. Intra-Processor Forwarding Is Allowed CPU 0 CPU 1 STORE 1 X R1 = LOAD X R2 = LOAD Y STORE 1 Y R3 = LOAD Y R4 = LOAD X R2 == 0 && R4 == 0 possible
  • 65. Stores Are Transitively Visible CPU 0 CPU 1 CPU 2 STORE 1 X R1 = LOAD X STORE 1 Y R2 = LOAD Y R3 = LOAD X R1 == 1 && R2 == 1 && R3 == 0 impossible
  • 66. Stores Are Seen in a Consistent Order by Others CPU 0 CPU 1 CPU 2 CPU 3 STORE 1 X STORE 1 Y R1 = LOAD X R2 = LOAD Y R3 = LOAD Y R4 = LOAD X R1 == 0 && R2 == 0 && R3 == 1 && R4 == 0 impossible
  • 67. X86 Memory Ordering Summary ● LOAD after LOAD never reordered ● STORE after STORE never reordered ● STORE after LOAD never reordered ● STOREs are transitively visible ● STOREs are seen in consistent order by others ● Intra-processor STORE forwarding is possible ● LOAD from different location after STORE may be reordered ● In short, quite reasonably strong enough ● For more detail, refer to `Intel Architecture Software Developer’s Manual`
  • 68. Summary ● Nature of Parallel Land is counter-intuitive ○ Cannot define order of events without interaction ○ Ordering rule is different for different environment ○ Memory model defines their ordering rule ○ In short, they’re all mad here ● For human-intuitive and correct program, interaction is necessary ○ Every memory model provides synchronization primitives like atomic instruction and memory barrier, etc ○ Such interaction is expensive in common ● Linux kernel memory model is based on weakest memory model, Alpha ○ Kernel programmers should assume Alpha when writing architecture independent code ○ Because of the expensive cost of synchronization primitives, programmer should use only necessary primitives on necessary location