无锁编程

无锁编程
chenxiaojie@qiyi.com
2016.08.18

Content
• Parallel
• Barrier
• Memory Order
• Volatile
• Atomic
• Lock-Free
• ABA Problem
• Reference

Parallel Computing
• Cache Coherence
• https://en.wikipedia.org/wiki/Cache_coherence
• False sharing
• Sequential Consistency
• https://en.wikipedia.org/wiki/Sequential_consistency
• Compiler, CPU, multicore
• Cache load, register

False Sharing
• Solution: Padding

Processor Guaranteed Atomic
• Bus Lock
• https://software.intel.com/en-us/node/544402
• LOCK# signal
• Cache Lock
• Between CPU and Memory
• Cache Coherence

Memory Barrier
• Memory Barrier
• https://en.wikipedia.org/wiki/Memory_barrier
• Causes a CPU or compiler to enforce an ordering constraint on
memory operations issued before and after the barrier instruction
• Compile-time Memory Ordering
• atomic_thread_fence(memory_order_acq_rel);
• Forbids compiler to reorder read and write commands around it

Memory Ordering
• Memory Ordering
• https://en.wikipedia.org/wiki/Memory_ordering
• The runtime order of accesses to computer memory by a CPU
• Sequential Consistency
• All reads and all writes are in-order
• Relaxed consistency
• Some types of reordering are allowed
• Weak consistency
• Reads and writes are arbitrarily reordered, limited only by explicit
memory barriers

Volatile
• Volatile
• https://en.wikipedia.org/wiki/Volatile_(computer_programming)
• Un-cacheable variable
• Prevents reordering between volatile variables
• Not applicable
• Depend on other variable
• Depend on old value
• Enhanced in Java
• Write: release
• Read: acquire

Volatile in Java
Read
Write
StoreStore
Volatile Write
StoreLoad
Volatile Read
LoadStore
Read
LoadLoad
Write

Memory Ordering in C++11
• memory_order_relaxed
• memory_order_acquire
• memory_order_release
• memory_order_consume
• memory_order_acq_rel
• memory_order_seq_cst

Relaxed Ordering
• Atomicity
• Modification order consistency
• Example
• A is sequenced-before B, C is sequenced before D
• Is allowed to produce r1 == r2 == 42 ?
• Reference counters of std::shared_ptr

Relaxed Ordering
// thread 1
r1 = y.load(memory_order_relaxed); // A
x.store(r1, memory_order_relaxed); // B
// thread 2
r2 = x.load(memory_order_relaxed); // C
y.store(42, memory_order_relaxed); // D
// possible order
y.store(42, memory_order_relaxed);
r1 = y.load(memory_order_relaxed);
x.store(r1, memory_order_relaxed);
r2 = x.load(memory_order_relaxed);

Release-Acquire Ordering
• Between the threads releasing and acquiring the same
atomic variable
• All memory writes happened-before the atomic store
• The atomic load happened-before all memory loads
• Example
• A sequenced-before B sequenced-before C
• C synchronizes-with D
• D sequenced-before E sequenced-before F

Release-Acquire Ordering
atomic<string*> ptr;
int data;
void producer() {
string* p = new string("Hello"); // A
data = 42; // B
ptr.store(p, memory_order_release); // C
}
void consumer() {
string* p2;
while (!(p2 = ptr.load(memory_order_acquire))); // D
assert(*p2 == "Hello"); // E
assert(data == 42); // F
}
thread t1(producer);
thread t2(consumer);

Release-Consume ordering
• Data-dependency relationship
• Example
• A sequenced-before B sequenced-before C
• C dependency-ordered-before D
• D sequenced-before E sequenced-before F
• A happens-before E ?
• B happens-before F ?
• Discouraged

Release-Consume ordering
atomic<string*> ptr;
int data;
void producer() {
string* p = new string("Hello"); // A
data = 42; // B
ptr.store(p, memory_order_release); // C
}
void consumer() {
string* p2;
while (!(p2 = ptr.load(memory_order_consume))); // D
assert(*p2 == "Hello"); // E
assert(data == 42); // F
}
thread t1(producer);
thread t2(consumer);

Sequentially-Consistent Ordering
• Order memory the same way as release/acquire ordering
• Establish a single total modification order of all atomic
operations
• Example
• Is r1 == r2 == 0 possible ?

atomic<int> x { 0 }, y { 0 };
// thread 1
x.store(1, memory_order_seq_cst);
r1 = y.load(memory_order_seq_cst);
// thread 2
y.store(1, memory_order_seq_cst);
r2 = x.load(memory_order_seq_cst);
// thread 1
x.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
// thread 2
atomic_thread_fence(memory_order_seq_cst);

atomic<int> x { 0 }, y { 0 };
// thread 1
x.store(1, memory_order_acq_rel);
r1 = y.load(memory_order_acq_rel);
// thread 2
y.store(1, memory_order_acq_rel);
r2 = x.load(memory_order_acq_rel);
// thread 1
x.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_acq_rel);
// thread 2
atomic_thread_fence(memory_order_acq_rel);

Atomic Operations
• atomic_store/load
• atomic_exchange
• atomic_compare_exchange_weak/strong
• atomic_fetch_add/sub/and/or/xor
• atomic_thread_fence
• atomic_signal_fence

Atomic Compare and Exchange
• compare_exchange_weak
• Allow to fail spuriously
• Act as if (actual value != expected) even if they are equal
• May require a loop
• compare_exchange_strong
• Distinguish spurious failure and concurrent acces
• Needs extra overhead to retry in the case of failure

Concurrency Control
• Pessimistic
• Blocking until the possibility of violation disappears
• Optimistic
• Collisions between transactions will rarely occur
• Use resources without acquiring locks
• If conflict, the committing rolls back and restart
• Compare and Swap
do {
expected = resource;
some operation;
} while (compare_and_swap(resource, expected, new_value) == false);

Progress Condition
• Blocking
• Obstruction-Free
• http://cs.brown.edu/people/mph/HerlihyLM03/main.pdf
• Lock-Free
• Wait-Free
while (!lock.compare_and_set(0, 1)) {
this_thread::yield();
}
while (!atomic_value.compare_and_set(local_value, local_value + 1)) {
local_value = atomic_value.load();
}
counter.fetch_add(1); // XADD

Lock-Free Stack
• Treiber (1986) Algorithm
• https://en.wikipedia.org/wiki/Treiber_Stack
• 《Treiber, R.K., 1986. Systems programming: Coping with
parallelism. International Business Machines Incorporated,
Thomas J. Watson Research Center.》

// Copyright 2016, Xiaojie Chen. All rights reserved.
// https://github.com/vorfeed/naesala
struct IStackNode {
IStackNode* next;
};
template <class T>
class LockfreeStack {
public:
void Push(T* node);
T* Pop();
private:
static_assert(is_base_of<IStackNode, T>::value, "");
atomic<uint64_t> top_ { 0 };
};
Lock-Free Stack

Lock-Free Stack
void Push(T* node) {
uint64_t last_top = 0;
uint64_t node_ptr = reinterpret_cast<uint64_t>(node);
do {
// Take out the top node of the stack
last_top = top_.load(memory_order_acquire);
// Add a new node as the top of the stack, and point to the old top
node->next = reinterpret_cast<T*>(last_top);
// If the top node is modified by other threads, discard this operation and retry
} while (!top_.compare_exchange_weak(last_top, node_ptr));
}

Lock-Free Stack
Node2 Node1
Top
NewNode Node2 Node1
Top
NewNode Node2 Node1
Top

Lock-Free Stack
T* Pop() {
T* top = nullptr;
uint64_t top_ptr = 0, new_top_ptr = 0;
do {
// Take out the top node of the stack
top_ptr = top_.load(memory_order_acquire);
top = reinterpret_cast<T*>(top_ptr);
// Empty stack
if (!top) {
return nullptr;
}
// Set the next node of the top node as the new top of the stack
new_top_ptr = reinterpret_cast<uint64_t>(top->next);
// If the top node is modified by other threads, discard this operation and retry
} while (!top_.compare_exchange_weak(top_ptr, new_top_ptr));
return top;
}

Lock-Free Stack
Node3 Node2 Node1
Top
Node3 Node2 Node1
Top
Node3 Node2 Node1
Top

Lock-Free Queue
• Michael & Scott (1996) Algorithm
• Java ConcurrentLinkedQueue
• 《Michael, Maged; Scott, Michael (1996). Simple, Fast, and
Practical Non-Blocking and Blocking Concurrent Queue
Algorithms. Proc. 15th Annual ACM Symp. on Principles of
Distributed Computing (PODC). pp. 267–275.
doi:10.1145/248052.248106. ISBN 0-89791-800-2.》

Lock-Free Queue
// Copyright 2016, Xiaojie Chen. All rights reserved.
// https://github.com/vorfeed/naesala
struct IListNode {
IListNode(uint64_t next) : next(next) {}
atomic<uint64_t> next;
};
template <class T>
class LockfreeList {
public:
// Both head and tail point to a dummy if queue is empty
LockfreeList() : dummy_(reinterpret_cast<uint64_t>(new T())),
head_(dummy_), tail_(dummy_) {}
private:
static_assert(is_base_of<IListNode<T>, T>::value, "");
uint64_t dummy_;
atomic<uint64_t> head_, tail_;
};

Lock-Free Queue
void Put(T* node) {
while (true) {
// The tail node of the queue
uint64_t tail_ptr = tail_.load(memory_order_acquire);
T* tail = reinterpret_cast<T*>(tail_ptr);
// The next node of the tail node
uint64_t tail_next_ptr = tail->next.load(memory_order_acquire);
T* tail_next = reinterpret_cast<T*>(tail_next_ptr);
// If the next node of tail node is modified by other threads
if (tail_next) {
// Try to help other threads to swing tail to the next node, and then retry
tail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(tail_next));
// Else try to link node at the end of the queue
} else if (tail->next.compare_exchange_weak(tail_next_ptr,
reinterpret_cast<uint64_t>(node))) {
// If successful, try to swing Tail to the inserted node
// Can also be done by other threads
tail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(node));
break;
}
}
}

Lock-Free Queue
Dummy Node1 Node2
Head
Tail
Dummy Node1 Node2
Head
Tail
Node3
Dummy Node1 Node2
Head
Tail
Node3

Lock-Free Queue
T* Take() {
while (true) {
// The head node of the queue
uint64_t head_ptr = head_.load(memory_order_acquire);
T* head = reinterpret_cast<T*>(head_ptr);
// The tail node of the queue
uint64_t tail_ptr = tail_.load(memory_order_acquire);
T* tail = reinterpret_cast<T*>(tail_ptr);
// The next node of the head node
uint64_t head_next_ptr = head->next.load(memory_order_acquire);
T* head_next = reinterpret_cast<T*>(head_next_ptr);
// Empty queue or the tail falling behind
if (head == tail) {
// Empty queue, couldn’t pop
if (!head_next) {
return nullptr;
}
// another thread is pushing and the tail is falling behind, try to advance it
tail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(head_next));
} else {
// Queue is not empty, do pop operation
}
}
return nullptr;
}

Lock-Free Queue
// pop operation
// another thread had just taken a node
if (!head_next) {
continue;
}
// copy the next node of the head node to a buffer
T data(*head_next);
// Try to swing head to the next node
if (head_.compare_exchange_weak(head_ptr, reinterpret_cast<uint64_t>(head_next))) {
// If successful, copy the buffer data to the head node
*head = move(data);
// Clear the next node pointer of the head node
head->next.store(0, memory_order_release);
// Return the head node
return head;
}

Lock-Free Queue
Dummy Node1 Node2
Head
Tail
Dummy Node1 Node2
Head
Tail
Node1 Dummy Node2
Head
Tail

ABA Problem
• https://en.wikipedia.org/wiki/ABA_problem
• Another thread change the value, do other work, then
change the value back
• Fooling the first thread into thinking "nothing has
changed"

ABA Problem
template <class T>
T* Pointer(uint64_t combine) {
return reinterpret_cast<T*>(combine & 0x0000FFFFFFFFFFFF);
}
template <class T>
uint64_t Combine(T* pointer) {
static atomic_short version(0);
return reinterpret_cast<uint64_t>(pointer) |
(static_cast<uint64_t>(version.fetch_add(1, memory_order_acq_rel)) << 48);
}

ABA Problem
void Push(T* node) {
uint64_t last_top_combine = 0;
uint64_t node_combine = Combine(node);
do {
last_top_combine = top_.load(memory_order_acquire);
node->next = Pointer<T>(last_top_combine);
// If the top node is still next, then assume no one has changed the stack
// (That statement is not always true because of the ABA problem)
// Atomically replace top with new node
} while (!top_.compare_exchange_weak(last_top_combine, node_combine));
}

ABA Problem
T* Pop() {
T* top = nullptr;
uint64_t top_combine = 0, new_top_combine = 0;
do {
top_combine = top_.load(memory_order_acquire);
top = Pointer<T>(top_combine);
if (!top) {
return nullptr;
}
new_top_combine = Combine(top->next);
// If the top node is still ret, then assume no one has changed the stack
// (That statement is not always true because of the ABA problem)
// Atomically replace top with next
} while (!top_.compare_exchange_weak(top_combine, new_top_combine));
return top;
}

Benchmark
0
500000000
1E+09
1.5E+09
1 PRODUCER 1 CONSUMER
SPSC
Condition Variable Queue Lock-Free Queue
0
500000000
1E+09
1.5E+09
1P1C 1P2C 1P4C 1P8C 1P16C 1P32C
SPMC
0
200000000
400000000
600000000
800000000
1E+09
1.2E+09
1P1C 2P1C 4P1C 8P1C 16P1C 32P1C
MPSC
0
200000000
400000000
600000000
800000000
1E+09
1.2E+09
1P1C 2P2C 4P4C 8P8C 16P16C 32P32C
MPMC

Reference
• 《Java Concurrency in Practice》
• 《The Art of Multiprocessor Programming》
• 《C++ Concurrency In Action》
• http://open-std.org
• java.util.concurrent
• https://github.com/vorfeed/naesala/lockfree

无锁编程

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 无锁编程

Similar to 无锁编程 (20)

Recently uploaded

Recently uploaded (20)

无锁编程