Lockless data structures
Sandeep Joshi (DC Engines)
Critical sections...
Stack::pop()
{
Lock.acquire()
Return top value
Move top to top->next
Lock.release();
}
HashTable::insert(element)
{
Lock.acquire()
Find hash bucket and insert
Lock.release();
}
Critical sections
Critical sections are like transactions. They ensure invariants on data structures
continue to hold.
Critical sections can be protected by
1. Locks (default approach)
2. Lockless - use Atomic operations (compare and swap instruction) and
load-store fences
3. Hardware transactional memory ( See Intel xbegin, xend, xabort)
Like Pune traffic
Not all Data structures easy to make lockless
Lists
● Singly-linked, doubly-linked.
● Queue, Stack, Set
Unordered : Hash table (build on singly-linked list solution)
Ordered : Skip list (build on singly-linked list), Red Black tree (requires localized
rebalancing), AVL tree (harder due to wider rebalancing).
Lockfree versus Waitfree
Concurrency levels
1. LockFree : if overall system progresses but individual threads may see delay.
You see retries (e.g. “While loop” which retries if atomic operation failed)
2. WaitFree : if the operation completes in finite number of steps. (e.g. a read
on a Multi-versioned data structure)
The same data structure can have some operations which are lockfree, and others
which are waitfree.
Basic weapon (for this talk)
C++ : std::atomic<T>has compare_exchange_strong(T& expectedValue, T desiredValue)
Java : AtomicReference (and other Atomic types) has compareAndSet(V expectedValue, V
desiredValue)
C (GNU builtin) : __sync_val_compare_and_swap(T* ptr, T expectedValue, T desiredValue)
Bool CAS(variable, expectedVal, desiredVal) {
If (Variable == expectedValue) {
Variable = desiredValue
Return true
} else { return false }
}
What we will cover
1. Stack
2. Queue
3. RCU
For Lists, see Herb Sutter’s talk on “Lock-free programming” at CppCon 2014.
Stack
bool Stack::push(int key) {
Node* newNode = new Node(key)
Do {
oldHead = top;
newNode-> next = oldHead
} While (not
top.compare_exchange(oldHead,
newNode))
}
Int Stack::pop() {
Do {
oldHead = top;
nextNode = top->next
} While (not
top.compare_exchange(top,nextNode));
int return_key = oldHead->key
return return_key;
}
Class Stack { atomic<Node*> top }
Stack
Problem : Every thread is doing read-modify-write on the same memory address
(Stack.top). The corresponding cache line keeps bouncing between CPU cores.
Solution : Find a way to match up simultaneous “push” and “pop” calls. Let the
two threads communicate without changing “Stack.top”.
Atomic exchange between 2 threads
EMPTY
value=nil
WAITING
value=T1.val
BUSY
value=T2.val
T1 comes, sets
its value, and
waits
T2 arrives, finds value
set. It atomically
exchanges T1.val with
T2.val and changes
state to BUSY
T1 who is
waiting reads
T2.val, resets
and returns
Use “compare and swap” to
atomically exchange a value
between two threads
Define an Exchanger {
state = empty, waiting, busy
Int value
}
Practical implementation in
java.util.concurrent.Exchanger
Stack + EliminationArray combination
EliminationArray is array of Exchanger objects
E1 E2 E3 …. En
stack.top
N1 N1 null
Every thread (push or pop)
first checks in
EliminationArray for a
complementary thread
After timeout, it calls
Stack.push or pop
What we will cover
1. Stack
2. Queue
3. RCU
Queues
Many dimensions to this problem
1. SPSC, SPMC, MPSC, MPMC (SPSC=Single Producer, Single Consumer)
2. Bounded vs unbounded
3. Blocking or nonblocking
4. Priorities, Intrusive, Ordering..
http://www.1024cores.net/home/lock-free-algorithms/queues
Queue with sentinel
HEAD TAIL
Sentinel New node
HEAD TAIL
Sentinel
HEAD TAIL
Deleted New sentinel
Enqueue
Dequeue
Return the node value and
turn it into sentinel
Unbounded SPSC (*incomplete)
SPSC_Queue { atomic<Node*> Head, Tail; }
enqueue(T elem) {
Node* newNode = new Node(elem)
Tail->next = newNode
Tail.store(newNode)
}
dequeue(T& returnElem) {
If (Head->next = null) { throw Empty; }
returnElem = Head->next->value;
Head.store(Head->next)
Delete oldHead;
}
Head = Tail = new Node()
First node is always Sentinel
Dequeue always returns value of
next node
Bounded SPSC
ProducerConsumerQueue
● atomic<int> readIndex
● Item records[size]
● atomic<int> writeIndex
enqueue(Item newElem) {
Int freeSlot = writeIndex.load()
If freeSlot + 1 != readIndex.load() {
records[freeSlot] = newElem
writeIndex.store(freeSlot)
}
dequeue(Item& returnElem) {
Int curSlot = readindex.load()
If curSlot != writeIndex.load() {
returnElem = record[curSlot]
readIndex.store(curSlot + 1)
}
Based on Facebook folly library
Avoid cache
line sharing
What we will cover
1. Stack
2. Queue
3. RCU
Multiple readers, one writer (with locks)
READER
1. Take Read lock
2. Safely read pointer and act
3. Release read lock
WRITER
1. Take write lock
2. Free pointer
3. Release write lock
This is the conventional approach
Multiple readers, one writer (RCU)
READER
1. Record new reader
2. Safely access the pointer
3. Inform reader finished
WRITER
1. Switch the pointer
2. Ensure all readers gone (Drain the queue in Grace period)
3. Free pointer
Multiple readers, one writer (RCU)
READER
1. Record new reader (rcu_read_lock)
2. Safely access the pointer
3. Inform reader finished (rcu_read_unlock)
WRITER
1. Switch the pointer (rcu_assign_pointer(ptr,val))
2. Ensure all readers gone (synchronize_rcu)
3. Free pointer
RCU (Read copy update)
On preemptible Linux kernels
1. Preemption disabled for Reader on calling “rcu_read_lock()”
2. Writer runs on every CPU core when “synchronize_rcu()” called to ensure all
readers have completed.
On real-time Linux kernels : Introduce two queues (current and next) to record the
Readers that were present before and after Writer started.
Userspace RCU : Same API now available for use in userspace
(https://github.com/urcu)
Some tricks used in lockless programming
1. Sentinels
2. Unused bits in 64-bit pointers
3. Lazy delete
4. Two (or more) bottlenecks better than one
5. Padding to avoid false cache line sharing
Trick 1 : Sentinels
Sentinel node is pre-allocated and never deleted.
Head and tail point to Sentinel when List or
Queue is empty.
This helps because when List/Queue transitions
from empty to non-empty or vice-versa, you don’t
have to update two variables atomically which
can get tricky.
class Queue {
Node* head, tail;
};
Head = tail = new Node(sentinel)
Trick 2 : Unused bits in pointer
Addresses on Intel x86-64 and ARM64 are limited to 48 bits. The unused higher
16 bits can be used to store a “marker” with every pointer. This allows you to use
“compare-and-swap” instruction to atomically change “pointer + custom info”
Facebook Folly C++ library : PackedSyncPtr, DiscriminatedPtr exploits this.
Java has AtomicMarkableReference, AtomicStampedReference.
Caveat : The number of unused bits may shrink with newer processors.
Intel also has “Cmpxcng16B” to manipulate 128 bit values.
Trick 3 : Lazy delete
Updater sets marker bit on the Node.
Marked Node is skipped during traversal until it is safe to delete
Deleted = true
Trick 4 : Two bottlenecks better than one
Cache line bouncing is reduced if threads can spin(i.e. do CAS) on multiple
variables instead of one. Seen in the Stack + EliminationArray example earlier.
Same with WaitQueue below. Each thread adds its own node to the WaitQueue.
It spins on local variable inside Node until woken up by Predecessor.
Wait Queue T1’s node T2’s node T3’s node
Trick 5 : Padding to avoid false cache line sharing
Class Queue {
Atomic<int> head;
Char cache_line_pad[CACHE_LINE_SIZE (e.g.64 byte)];
Atomic<int> tail; // Keeps head and tail on separate cache
lines
}
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
Locks vs Lockless
● Locks can increase context switches
● Lockless can increase cache line contention
Which option performs better depends on several factors...
Language support
Golang : Philosophy is to “share memory by communicating instead of
communicating by shared memory”. But see ‘sync.atomic” package.
Java : “volatile” variables ensure sequential consistency. “Java.util.concurrent”,
sun.misc.unsafe.compareAndSwapObject()
C++ : std::atomic provides multiple levels of consistency
1. sequential consistency.
2. acquire, release, consume (not discussed today).
3. relaxed.
Who uses lockless ?
1. Early adopters were desktop audio drivers [1]
2. MemSQL : pervasive use of lockfree data structures
3. Couchbase : Nitro storage engine
4. DataDomain (EMC) : lockfree doubly linked list
5. Facebook Folly library
6. Java.util.concurrent (Doug Lea)
7. Linux kernel (other mechanisms besides RCU)
[1] http://www.rossbencina.com/code/lockfree
Not covered
1. ABA problem and Hazard pointers
2. Weaker memory models
3. Concurrent Skip List, Hash tables, Trees
4. Underlying Memory allocation also needs to be lockfree (e.g. Streamflow)
References
1. Herlihy, et al. The Art of Multiprocessor Programming
2. McKenney, Paul. Is Parallel Programming Hard, And, If So, What Can You
Do. About It?
3. http://1024cores.net
4. http://preshing.com
5. http://www.rdrop.com/~paulmck/
6. http://www.rossbencina.com/code/lockfree

Lockless

  • 1.
  • 2.
    Critical sections... Stack::pop() { Lock.acquire() Return topvalue Move top to top->next Lock.release(); } HashTable::insert(element) { Lock.acquire() Find hash bucket and insert Lock.release(); }
  • 3.
    Critical sections Critical sectionsare like transactions. They ensure invariants on data structures continue to hold. Critical sections can be protected by 1. Locks (default approach) 2. Lockless - use Atomic operations (compare and swap instruction) and load-store fences 3. Hardware transactional memory ( See Intel xbegin, xend, xabort) Like Pune traffic
  • 4.
    Not all Datastructures easy to make lockless Lists ● Singly-linked, doubly-linked. ● Queue, Stack, Set Unordered : Hash table (build on singly-linked list solution) Ordered : Skip list (build on singly-linked list), Red Black tree (requires localized rebalancing), AVL tree (harder due to wider rebalancing).
  • 5.
    Lockfree versus Waitfree Concurrencylevels 1. LockFree : if overall system progresses but individual threads may see delay. You see retries (e.g. “While loop” which retries if atomic operation failed) 2. WaitFree : if the operation completes in finite number of steps. (e.g. a read on a Multi-versioned data structure) The same data structure can have some operations which are lockfree, and others which are waitfree.
  • 6.
    Basic weapon (forthis talk) C++ : std::atomic<T>has compare_exchange_strong(T& expectedValue, T desiredValue) Java : AtomicReference (and other Atomic types) has compareAndSet(V expectedValue, V desiredValue) C (GNU builtin) : __sync_val_compare_and_swap(T* ptr, T expectedValue, T desiredValue) Bool CAS(variable, expectedVal, desiredVal) { If (Variable == expectedValue) { Variable = desiredValue Return true } else { return false } }
  • 7.
    What we willcover 1. Stack 2. Queue 3. RCU For Lists, see Herb Sutter’s talk on “Lock-free programming” at CppCon 2014.
  • 8.
    Stack bool Stack::push(int key){ Node* newNode = new Node(key) Do { oldHead = top; newNode-> next = oldHead } While (not top.compare_exchange(oldHead, newNode)) } Int Stack::pop() { Do { oldHead = top; nextNode = top->next } While (not top.compare_exchange(top,nextNode)); int return_key = oldHead->key return return_key; } Class Stack { atomic<Node*> top }
  • 9.
    Stack Problem : Everythread is doing read-modify-write on the same memory address (Stack.top). The corresponding cache line keeps bouncing between CPU cores. Solution : Find a way to match up simultaneous “push” and “pop” calls. Let the two threads communicate without changing “Stack.top”.
  • 10.
    Atomic exchange between2 threads EMPTY value=nil WAITING value=T1.val BUSY value=T2.val T1 comes, sets its value, and waits T2 arrives, finds value set. It atomically exchanges T1.val with T2.val and changes state to BUSY T1 who is waiting reads T2.val, resets and returns Use “compare and swap” to atomically exchange a value between two threads Define an Exchanger { state = empty, waiting, busy Int value } Practical implementation in java.util.concurrent.Exchanger
  • 11.
    Stack + EliminationArraycombination EliminationArray is array of Exchanger objects E1 E2 E3 …. En stack.top N1 N1 null Every thread (push or pop) first checks in EliminationArray for a complementary thread After timeout, it calls Stack.push or pop
  • 12.
    What we willcover 1. Stack 2. Queue 3. RCU
  • 13.
    Queues Many dimensions tothis problem 1. SPSC, SPMC, MPSC, MPMC (SPSC=Single Producer, Single Consumer) 2. Bounded vs unbounded 3. Blocking or nonblocking 4. Priorities, Intrusive, Ordering.. http://www.1024cores.net/home/lock-free-algorithms/queues
  • 14.
    Queue with sentinel HEADTAIL Sentinel New node HEAD TAIL Sentinel HEAD TAIL Deleted New sentinel Enqueue Dequeue Return the node value and turn it into sentinel
  • 15.
    Unbounded SPSC (*incomplete) SPSC_Queue{ atomic<Node*> Head, Tail; } enqueue(T elem) { Node* newNode = new Node(elem) Tail->next = newNode Tail.store(newNode) } dequeue(T& returnElem) { If (Head->next = null) { throw Empty; } returnElem = Head->next->value; Head.store(Head->next) Delete oldHead; } Head = Tail = new Node() First node is always Sentinel Dequeue always returns value of next node
  • 16.
    Bounded SPSC ProducerConsumerQueue ● atomic<int>readIndex ● Item records[size] ● atomic<int> writeIndex enqueue(Item newElem) { Int freeSlot = writeIndex.load() If freeSlot + 1 != readIndex.load() { records[freeSlot] = newElem writeIndex.store(freeSlot) } dequeue(Item& returnElem) { Int curSlot = readindex.load() If curSlot != writeIndex.load() { returnElem = record[curSlot] readIndex.store(curSlot + 1) } Based on Facebook folly library Avoid cache line sharing
  • 17.
    What we willcover 1. Stack 2. Queue 3. RCU
  • 18.
    Multiple readers, onewriter (with locks) READER 1. Take Read lock 2. Safely read pointer and act 3. Release read lock WRITER 1. Take write lock 2. Free pointer 3. Release write lock This is the conventional approach
  • 19.
    Multiple readers, onewriter (RCU) READER 1. Record new reader 2. Safely access the pointer 3. Inform reader finished WRITER 1. Switch the pointer 2. Ensure all readers gone (Drain the queue in Grace period) 3. Free pointer
  • 20.
    Multiple readers, onewriter (RCU) READER 1. Record new reader (rcu_read_lock) 2. Safely access the pointer 3. Inform reader finished (rcu_read_unlock) WRITER 1. Switch the pointer (rcu_assign_pointer(ptr,val)) 2. Ensure all readers gone (synchronize_rcu) 3. Free pointer
  • 21.
    RCU (Read copyupdate) On preemptible Linux kernels 1. Preemption disabled for Reader on calling “rcu_read_lock()” 2. Writer runs on every CPU core when “synchronize_rcu()” called to ensure all readers have completed. On real-time Linux kernels : Introduce two queues (current and next) to record the Readers that were present before and after Writer started. Userspace RCU : Same API now available for use in userspace (https://github.com/urcu)
  • 22.
    Some tricks usedin lockless programming 1. Sentinels 2. Unused bits in 64-bit pointers 3. Lazy delete 4. Two (or more) bottlenecks better than one 5. Padding to avoid false cache line sharing
  • 23.
    Trick 1 :Sentinels Sentinel node is pre-allocated and never deleted. Head and tail point to Sentinel when List or Queue is empty. This helps because when List/Queue transitions from empty to non-empty or vice-versa, you don’t have to update two variables atomically which can get tricky. class Queue { Node* head, tail; }; Head = tail = new Node(sentinel)
  • 24.
    Trick 2 :Unused bits in pointer Addresses on Intel x86-64 and ARM64 are limited to 48 bits. The unused higher 16 bits can be used to store a “marker” with every pointer. This allows you to use “compare-and-swap” instruction to atomically change “pointer + custom info” Facebook Folly C++ library : PackedSyncPtr, DiscriminatedPtr exploits this. Java has AtomicMarkableReference, AtomicStampedReference. Caveat : The number of unused bits may shrink with newer processors. Intel also has “Cmpxcng16B” to manipulate 128 bit values.
  • 25.
    Trick 3 :Lazy delete Updater sets marker bit on the Node. Marked Node is skipped during traversal until it is safe to delete Deleted = true
  • 26.
    Trick 4 :Two bottlenecks better than one Cache line bouncing is reduced if threads can spin(i.e. do CAS) on multiple variables instead of one. Seen in the Stack + EliminationArray example earlier. Same with WaitQueue below. Each thread adds its own node to the WaitQueue. It spins on local variable inside Node until woken up by Predecessor. Wait Queue T1’s node T2’s node T3’s node
  • 27.
    Trick 5 :Padding to avoid false cache line sharing Class Queue { Atomic<int> head; Char cache_line_pad[CACHE_LINE_SIZE (e.g.64 byte)]; Atomic<int> tail; // Keeps head and tail on separate cache lines } https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
  • 28.
    Locks vs Lockless ●Locks can increase context switches ● Lockless can increase cache line contention Which option performs better depends on several factors...
  • 29.
    Language support Golang :Philosophy is to “share memory by communicating instead of communicating by shared memory”. But see ‘sync.atomic” package. Java : “volatile” variables ensure sequential consistency. “Java.util.concurrent”, sun.misc.unsafe.compareAndSwapObject() C++ : std::atomic provides multiple levels of consistency 1. sequential consistency. 2. acquire, release, consume (not discussed today). 3. relaxed.
  • 30.
    Who uses lockless? 1. Early adopters were desktop audio drivers [1] 2. MemSQL : pervasive use of lockfree data structures 3. Couchbase : Nitro storage engine 4. DataDomain (EMC) : lockfree doubly linked list 5. Facebook Folly library 6. Java.util.concurrent (Doug Lea) 7. Linux kernel (other mechanisms besides RCU) [1] http://www.rossbencina.com/code/lockfree
  • 31.
    Not covered 1. ABAproblem and Hazard pointers 2. Weaker memory models 3. Concurrent Skip List, Hash tables, Trees 4. Underlying Memory allocation also needs to be lockfree (e.g. Streamflow)
  • 32.
    References 1. Herlihy, etal. The Art of Multiprocessor Programming 2. McKenney, Paul. Is Parallel Programming Hard, And, If So, What Can You Do. About It? 3. http://1024cores.net 4. http://preshing.com 5. http://www.rdrop.com/~paulmck/ 6. http://www.rossbencina.com/code/lockfree