This document discusses concepts related to lock-free and concurrent programming including parallel computing, memory barriers, volatile variables, atomic operations, and solving problems like false sharing and the ABA problem. It provides explanations and examples of memory ordering models like release-acquire, release-consume, and sequential consistency. Data structures like lock-free stacks and queues are presented along with their algorithms. Benchmark results comparing concurrent queues to lock-free implementations are shown. References for further reading on Java concurrency, C++ concurrency, and multiprocessor programming are provided.
5. Processor Guaranteed Atomic
• Bus Lock
• https://software.intel.com/en-us/node/544402
• LOCK# signal
• Cache Lock
• Between CPU and Memory
• Cache Coherence
6. Memory Barrier
• Memory Barrier
• https://en.wikipedia.org/wiki/Memory_barrier
• Causes a CPU or compiler to enforce an ordering constraint on
memory operations issued before and after the barrier instruction
• Compile-time Memory Ordering
• atomic_thread_fence(memory_order_acq_rel);
• Forbids compiler to reorder read and write commands around it
7. Memory Ordering
• Memory Ordering
• https://en.wikipedia.org/wiki/Memory_ordering
• The runtime order of accesses to computer memory by a CPU
• Sequential Consistency
• All reads and all writes are in-order
• Relaxed consistency
• Some types of reordering are allowed
• Weak consistency
• Reads and writes are arbitrarily reordered, limited only by explicit
memory barriers
11. Relaxed Ordering
• Atomicity
• Modification order consistency
• Example
• A is sequenced-before B, C is sequenced before D
• Is allowed to produce r1 == r2 == 42 ?
• Reference counters of std::shared_ptr
12. Relaxed Ordering
// thread 1
r1 = y.load(memory_order_relaxed); // A
x.store(r1, memory_order_relaxed); // B
// thread 2
r2 = x.load(memory_order_relaxed); // C
y.store(42, memory_order_relaxed); // D
// possible order
y.store(42, memory_order_relaxed);
r1 = y.load(memory_order_relaxed);
x.store(r1, memory_order_relaxed);
r2 = x.load(memory_order_relaxed);
13. Release-Acquire Ordering
• Between the threads releasing and acquiring the same
atomic variable
• All memory writes happened-before the atomic store
• The atomic load happened-before all memory loads
• Example
• A sequenced-before B sequenced-before C
• C synchronizes-with D
• D sequenced-before E sequenced-before F
14. Release-Acquire Ordering
atomic<string*> ptr;
int data;
void producer() {
string* p = new string("Hello"); // A
data = 42; // B
ptr.store(p, memory_order_release); // C
}
void consumer() {
string* p2;
while (!(p2 = ptr.load(memory_order_acquire))); // D
assert(*p2 == "Hello"); // E
assert(data == 42); // F
}
thread t1(producer);
thread t2(consumer);
15. Release-Consume ordering
• Data-dependency relationship
• Example
• A sequenced-before B sequenced-before C
• C dependency-ordered-before D
• D sequenced-before E sequenced-before F
• A happens-before E ?
• B happens-before F ?
• Discouraged
16. Release-Consume ordering
atomic<string*> ptr;
int data;
void producer() {
string* p = new string("Hello"); // A
data = 42; // B
ptr.store(p, memory_order_release); // C
}
void consumer() {
string* p2;
while (!(p2 = ptr.load(memory_order_consume))); // D
assert(*p2 == "Hello"); // E
assert(data == 42); // F
}
thread t1(producer);
thread t2(consumer);
17. Sequentially-Consistent Ordering
• Order memory the same way as release/acquire ordering
• Establish a single total modification order of all atomic
operations
• Example
• Is r1 == r2 == 0 possible ?
21. Atomic Compare and Exchange
• compare_exchange_weak
• Allow to fail spuriously
• Act as if (actual value != expected) even if they are equal
• May require a loop
• compare_exchange_strong
• Distinguish spurious failure and concurrent acces
• Needs extra overhead to retry in the case of failure
22. Concurrency Control
• Pessimistic
• Blocking until the possibility of violation disappears
• Optimistic
• Collisions between transactions will rarely occur
• Use resources without acquiring locks
• If conflict, the committing rolls back and restart
• Compare and Swap
do {
expected = resource;
some operation;
} while (compare_and_swap(resource, expected, new_value) == false);
24. Lock-Free Stack
• Treiber (1986) Algorithm
• https://en.wikipedia.org/wiki/Treiber_Stack
• 《Treiber, R.K., 1986. Systems programming: Coping with
parallelism. International Business Machines Incorporated,
Thomas J. Watson Research Center.》
26. Lock-Free Stack
void Push(T* node) {
uint64_t last_top = 0;
uint64_t node_ptr = reinterpret_cast<uint64_t>(node);
do {
// Take out the top node of the stack
last_top = top_.load(memory_order_acquire);
// Add a new node as the top of the stack, and point to the old top
node->next = reinterpret_cast<T*>(last_top);
// If the top node is modified by other threads, discard this operation and retry
} while (!top_.compare_exchange_weak(last_top, node_ptr));
}
28. Lock-Free Stack
T* Pop() {
T* top = nullptr;
uint64_t top_ptr = 0, new_top_ptr = 0;
do {
// Take out the top node of the stack
top_ptr = top_.load(memory_order_acquire);
top = reinterpret_cast<T*>(top_ptr);
// Empty stack
if (!top) {
return nullptr;
}
// Set the next node of the top node as the new top of the stack
new_top_ptr = reinterpret_cast<uint64_t>(top->next);
// If the top node is modified by other threads, discard this operation and retry
} while (!top_.compare_exchange_weak(top_ptr, new_top_ptr));
return top;
}
30. Lock-Free Queue
• Michael & Scott (1996) Algorithm
• Java ConcurrentLinkedQueue
• 《Michael, Maged; Scott, Michael (1996). Simple, Fast, and
Practical Non-Blocking and Blocking Concurrent Queue
Algorithms. Proc. 15th Annual ACM Symp. on Principles of
Distributed Computing (PODC). pp. 267–275.
doi:10.1145/248052.248106. ISBN 0-89791-800-2.》
31. Lock-Free Queue
// Copyright 2016, Xiaojie Chen. All rights reserved.
// https://github.com/vorfeed/naesala
struct IListNode {
IListNode(uint64_t next) : next(next) {}
atomic<uint64_t> next;
};
template <class T>
class LockfreeList {
public:
// Both head and tail point to a dummy if queue is empty
LockfreeList() : dummy_(reinterpret_cast<uint64_t>(new T())),
head_(dummy_), tail_(dummy_) {}
private:
static_assert(is_base_of<IListNode<T>, T>::value, "");
uint64_t dummy_;
atomic<uint64_t> head_, tail_;
};
32. Lock-Free Queue
void Put(T* node) {
while (true) {
// The tail node of the queue
uint64_t tail_ptr = tail_.load(memory_order_acquire);
T* tail = reinterpret_cast<T*>(tail_ptr);
// The next node of the tail node
uint64_t tail_next_ptr = tail->next.load(memory_order_acquire);
T* tail_next = reinterpret_cast<T*>(tail_next_ptr);
// If the next node of tail node is modified by other threads
if (tail_next) {
// Try to help other threads to swing tail to the next node, and then retry
tail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(tail_next));
// Else try to link node at the end of the queue
} else if (tail->next.compare_exchange_weak(tail_next_ptr,
reinterpret_cast<uint64_t>(node))) {
// If successful, try to swing Tail to the inserted node
// Can also be done by other threads
tail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(node));
break;
}
}
}
33. Lock-Free Queue
Dummy Node1 Node2
Head
Tail
Dummy Node1 Node2
Head
Tail
Node3
Dummy Node1 Node2
Head
Tail
Node3
34. Lock-Free Queue
T* Take() {
while (true) {
// The head node of the queue
uint64_t head_ptr = head_.load(memory_order_acquire);
T* head = reinterpret_cast<T*>(head_ptr);
// The tail node of the queue
uint64_t tail_ptr = tail_.load(memory_order_acquire);
T* tail = reinterpret_cast<T*>(tail_ptr);
// The next node of the head node
uint64_t head_next_ptr = head->next.load(memory_order_acquire);
T* head_next = reinterpret_cast<T*>(head_next_ptr);
// Empty queue or the tail falling behind
if (head == tail) {
// Empty queue, couldn’t pop
if (!head_next) {
return nullptr;
}
// another thread is pushing and the tail is falling behind, try to advance it
tail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(head_next));
} else {
// Queue is not empty, do pop operation
}
}
return nullptr;
}
35. Lock-Free Queue
// pop operation
// another thread had just taken a node
if (!head_next) {
continue;
}
// copy the next node of the head node to a buffer
T data(*head_next);
// Try to swing head to the next node
if (head_.compare_exchange_weak(head_ptr, reinterpret_cast<uint64_t>(head_next))) {
// If successful, copy the buffer data to the head node
*head = move(data);
// Clear the next node pointer of the head node
head->next.store(0, memory_order_release);
// Return the head node
return head;
}
39. ABA Problem
void Push(T* node) {
uint64_t last_top_combine = 0;
uint64_t node_combine = Combine(node);
do {
last_top_combine = top_.load(memory_order_acquire);
node->next = Pointer<T>(last_top_combine);
// If the top node is still next, then assume no one has changed the stack
// (That statement is not always true because of the ABA problem)
// Atomically replace top with new node
} while (!top_.compare_exchange_weak(last_top_combine, node_combine));
}
40. ABA Problem
T* Pop() {
T* top = nullptr;
uint64_t top_combine = 0, new_top_combine = 0;
do {
top_combine = top_.load(memory_order_acquire);
top = Pointer<T>(top_combine);
if (!top) {
return nullptr;
}
new_top_combine = Combine(top->next);
// If the top node is still ret, then assume no one has changed the stack
// (That statement is not always true because of the ABA problem)
// Atomically replace top with next
} while (!top_.compare_exchange_weak(top_combine, new_top_combine));
return top;
}
42. Reference
• 《Java Concurrency in Practice》
• 《The Art of Multiprocessor Programming》
• 《C++ Concurrency In Action》
• http://open-std.org
• java.util.concurrent
• https://github.com/vorfeed/naesala/lockfree