5. Hardware Optimizations
- Write Buffer
• On a write, a processor simply inserts the write operation into the
write buffer and proceeds without waiting for the write to complete
• In order to effectively hide the latency of write operations
• Therefore, P1, P2 are all in critical sections
Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”
6. Hardware Optimizations
- Overlapped Writes
• Assume the Data and Head variables reside in different memory modules
• Since the write to Head may be injected into the network before the write to Data
has reached its memory module
• Therefore, it is possible for another processor to observe the new value of Head
and yet obtain the old value of Data
• Reordering of write operations
Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”
(coalesced write)
7. Hardware Optimizations
- Non−blocking Reads
• If P2 is allowed to issue its read operations in an overlapped
fashion, there is the possibility for the read of Data to arrive
at its memory module before the write from P1 while the
read of Head reaches its memory module after the write
from P1 => P2.Data =2000/ P2.Head = 0
Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”
(coalesced read)
9. 所以怎麼辦? 理想上
• Sequential Consistency (單核operations順序=
多核operation順序)
– The result of any execution is the same as if the
operations of all the processors were executed in
some sequential order, and the operations of each
individual processor appear in this sequence in
the order specified by its program
• There is no local reordering
• Each write becomes visible to all threads
Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”
Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”
10. 事實上,不保證SC
Memory model Local ordering Multiple-copy atomic
model
Total store ordering Intel x86 X O
Relaxed memory
model
ARM X X
Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”
Developers需自己寫code管理記憶體操作順序
11. Hardware Optimizations這麼多,我要
怎知道程式的運作行為(Programmer-
observable Behavior)?
• Mathematically rigorous architecture
definitions
– Luc Maranget, etc., “A Tutorial Introduction to the
ARM and POWER Relaxed Memory Models”
• Hardware semantics
– Shaked Flur, etc., “Modelling the ARMv8
Architecture, Operationally Concurrency and ISA”
• C/C++11 memory model
• …?
12. Mathematically Rigorous Architecture
Definitions – For Example
• Message Passing (MP)
Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”
Y=1; r1=y; r2=x; x=1 r1=1 ∧ r2=0
x86-TSO : forbidden
ARM: allowed
Partial-order Propagation
?
22. Read Copy Update (RCU)
• Read-mostly situations
• Typical RCU: update into removal and reclamation (disrupt)
– Removal and Replacing references to data items can run concurrently with readers
– Remove pointers to a data structure, so that subsequent readers cannot gain a
reference to it
– RCU provides implicit low-overhead communication between readers and reclaimers
(synchronize_rcu())
https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt
https://lwn.net/Articles/262464/
25. Concurrent malloc(3)
• How to false cache sharing
– Modern multi-processor systems preserve a coherent
view of memory on a per-cache-line basis
• How to reduce lock contention
Jason Evans, “a scalable concurrent malloc implementation for freebsd”
26. jemalloc
• Phk-malloc was specially optimized to minimize the working set of pages, jemalloc
must be more concerned with cache locality
• jemalloc first tries to minimize memory usage, and tries to allocate contiguously
(weaker security)
• One way of fixing this issue is to pad allocations, but padding is in direct opposition
to the goal of packing objects as tightly as possible; it can cause severe internal
fragmentation. jemalloc instead relies on multiple allocation arenas to reduce the
problem
• One of the main goals for this allocator was to reduce lock contention for multi-
threaded applications by using a single 2 allocator lock, each free list had its own
lock
• The solution was to use multiple
arenas for allocation, and assign threads
to arenas via hashing of the thread identifiers
Jason Evans, “a scalable concurrent malloc implementation for freebsd”
28. Linux Scalability to Many Cores -
Per-core Mount Caches
Silas Boyd-Wickizer, etc. , “An Analysis of Linux Scalability to Many Cores”
• Observation: mount table is
rarely modified
• Common case: cores access
per-core tables
• Modify mount table: invalidate
per-core tables
29. Linux Scalability to Many Cores -
Sloppy Counters
• Because reading reference count is slow
Silas Boyd-Wickizer, etc. , “An Analysis of Linux Scalability to Many Cores”
30. 來點Concurrency Security 例子
• Concurrency fuzzer
– Sebastian Burckhardt, etc., “A Randomized
Scheduler with Probabilistic Guarantees of Finding
Bugs”
• Timing side channel attack
– Yeongjin Jang, etc., “Breaking Kernel Address
Space Layout Randomization with Intel TSX”
31. Concurrency Fuzzer-
Randomized Scheduler
Sebastian Burckhardt, etc., “A Randomized Scheduler with Probabilistic
Guarantees of Finding Bugs”
Randomized Scheduler
基本上,Read/ Write reordering in hardware 是沒有模擬到的
Find Violation (Order/ Atomicity)
33. Intel Transactional Synchronization
Extensions
• the assembly instruction xbegin can return various
results that represent the hardware's suggestions for
how to proceed and reasons for failure: success, a
suggestion to retry, a potential cause for the abort
• To effectively use TSX it's imperative to understand it's
implementation and limitations. TSX is implemented
using the cache coherence protocol, which x86
machines already implement. When a transaction
begins, the processor starts tracking read and write
sets of cache lines which have been brought into the L1
cache. If at any point during a logical core's execution
of a transaction another core modifies a cache line in
the read or write set then the transaction is aborted.
Nick Stanley, “Hardware Transactional Memory with Intel’s TSX”
34. Intel Transactional Synchronization
Extensions - Suppressing exceptions
• a transaction aborts when such a hardware exception occurs during the
execution of the transaction. However, unlike normal situations where the
OS intervenes and handles these exceptions gracefully, TSX instead
invokes a user-specified abort handler, without informing the underlying
OS. More precisely, TSX treats these exceptions in a synchronous
manner—immediately executing an abort handler while suppressing the
exception itself. In other words, the exception inside the transaction will
not be communicated to the underlying OS. This allows us to engage in
abnormal behavior (e.g., attempting to access privileged, i.e., kernel,
memory regions) without worrying about crashing the program. In DrK,
we break KASLR by turning this surprising behavior into a timing channel
that leaks the status (e.g., mapped or unmapped) of all kernel pages.
35. Timing Side Channel Attack
• TSX instead invokes a user-
specified abort handler, without
informing the underlying OS
• 也就是說我在User space就可以
知道kennel address with random
(!!!)
Yeongjin Jang, etc., “Breaking Kernel Address Space Layout
Randomization with Intel TSX”
36. Reference
• Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency
Models: A Tutorial”
• Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER
Relaxed Memory Models”
• Shaked Flur, etc., “Modelling the ARMv8 Architecture, Operationally
Concurrency and ISA”
• https://www.youtube.com/watch?v=6QU37TwRO4w
• http://www.cl.cam.ac.uk/~sf502/popl16/help.html
• Jade Alglave, etc., “The Semantics of Power and ARM Multiprocessor
Machine Code”
• Paul E. McKenney, “Memory Barriers: a Hardware View for Software
Hackers”
37. Reference
C/C++ 11 memory model
• https://www.youtube.com/watch?v=S-x-23lrRnc
• Reinoud Elhorst, “Lowering C11 Atomics for ARM in LLVM”
• Torvald Riegel, “Modern C/C++ concurrency”
• Mark Barry, “Mathematizing C++ concurrency”
LMAX
• https://github.com/LMAX-Exchange/disruptor
• https://martinfowler.com/articles/lmax.html
• http://mechanitis.blogspot.tw/2011/06/dissecting-disruptor-how-do-i-read-
from.html
RCU
• https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt
• https://lwn.net/Articles/262464/
• https://lwn.net/Articles/253651/
• https://lwn.net/Articles/264090/
38. Reference
Concurrent malloc(3)
• Jason Evans, “a scalable concurrent malloc implementation
for freebsd”
Concurrency security
• Sebastian Burckhardt, etc., “A Randomized Scheduler with
Probabilistic Guarantees of Finding Bugs”
• Ralf-Philipp Weinmann, etc., “Concurrency: A problem and
opportunity in the exploitation of memory corruptions”
• Yeongjin Jang, etc., “Breaking Kernel Address Space Layout
Randomization with Intel TSX”
• Nick Stanley, “Hardware Transactional Memory with Intel’s
TSX” (有建議的Intel concurrency寫法)