SlideShare a Scribd company logo
@kavya719
Let’s talk locks!
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
go-locks/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
kavya
locks.
“locks are slow”
“locks are slow”
lock contention causes ~10x latency
latency(ms)
time
“locks are slow”
…but they’re used everywhere.
from schedulers to databases and web servers.
lock contention causes ~10x latency
latency(ms)
time
“locks are slow”
…but they’re used everywhere.
from schedulers to databases and web servers.
lock contention causes ~10x latency
latency(ms)
time
?
let’s analyze its performance!
performance models for contention
let’s build a lock!
a tour through lock internals
let’s use it, smartly!
a few closing strategies
our case-study
Lock implementations are hardware, ISA, OS and language specific:



We assume an x86_64 SMP machine running a modern Linux.

We’ll look at the lock implementation in Go 1.12.
CPU 0 CPU 1
cache cache
interconnect
memory
simplified SMP system diagram
use as you would threads 

> go handle_request(r)

but user-space threads:

managed entirely by the Go runtime, not the operating system.
The unit of concurrent execution: goroutines.
a brief go primer
use as you would threads 

> go handle_request(r)

but user-space threads:

managed entirely by the Go runtime, not the operating system.
The unit of concurrent execution: goroutines.
a brief go primer
Data shared between goroutines must be synchronized.
One way is to use the blocking, non-recursive lock construct:
> var mu sync.Mutex

mu.Lock()

…
mu.Unlock()
let’s build a lock!
a tour through lock internals.
want: “mutual exclusion”
only one thread has access to shared data at any given time
T1
running on CPU 1
T2
running on CPU 2
func reader() {

// Read a task

t := tasks.get()



// Do something with it.

...
}
func writer() {

// Write to tasks

tasks.put(t)
}
// track whether tasks is
// available (0) or not (1)
// shared ring buffer
var tasks Tasks
want: “mutual exclusion”
only one thread has access to shared data at any given time
func reader() {

// Read a task

t := tasks.get()



// Do something with it.

...
}
func writer() {

// Write to tasks

tasks.put(t)
}
// track whether tasks is
// available (0) or not (1)
// shared ring buffer
var tasks Tasks
want: “mutual exclusion”
…idea! use a flag?
T1
running on CPU 1
T2
running on CPU 2
// track whether tasks can be
// accessed (0) or not (1)
var flag int
var tasks Tasks
// track whether tasks can be
// accessed (0) or not (1)
var flag int
var tasks Tasks
func reader() {

for {
/* If flag is 0,
can access tasks. */

if flag == 0 {

/* Set flag */
flag++
...

/* Unset flag */
flag--
return
}
/* Else, keep looping. */ 

}
}
T1
running on CPU 1
// track whether tasks can be
// accessed (0) or not (1)
var flag int
var tasks Tasks
func reader() {

for {
/* If flag is 0,
can access tasks. */

if flag == 0 {

/* Set flag */
flag++
...

/* Unset flag */
flag--
return
}
/* Else, keep looping. */ 

}
}
func writer() {

for {
/* If flag is 0,
can access tasks. */

if flag == 0 {

/* Set flag */
flag++
...

/* Unset flag */
flag--
return
}
/* Else, keep looping. */ 

}
}
T1
running on CPU 1
T2
running on CPU 2
// track whether tasks can be
// accessed (0) or not (1)
var flag int
var tasks Tasks
func reader() {

for {
/* If flag is 0,
can access tasks. */

if flag == 0 {

/* Set flag */
flag++
...

/* Unset flag */
flag--
return
}
/* Else, keep looping. */ 

}
}
func writer() {

for {
/* If flag is 0,
can access tasks. */

if flag == 0 {

/* Set flag */
flag++
...

/* Unset flag */
flag--
return
}
/* Else, keep looping. */ 

}
}
T1
running on CPU 1
T2
running on CPU 2
flag++
T1
running on CPU 1
flag++
CPU
memory
1. Read (0)
2. Modify
3. Write (1)
T1
running on CPU 1
R
W
flag++
timeline of
memory operations
T1
running on CPU 1
R
R
W
flag++
if flag == 0
timeline of
memory operations
T1
running on CPU 1
T2
running on CPU 2
T2 may observe T1’s RMW half-complete
atomicity
A memory operation is non-atomic if it can be
observed half-complete by another thread.
An operation may be non-atomic because it:

• uses multiple CPU instructions:

operations on a large data structure; 

compiler decisions.

• use a single non-atomic CPU instruction:

RMW instructions; unaligned loads and stores.
> o := Order {
id: 10,
name: “yogi bear”,
order: “pie”,
count: 3,
}
atomicity
A memory operation is non-atomic if it can be
observed half-complete by another thread.
An operation may be non-atomic because it:

• uses multiple CPU instructions:

operations on a large data structure; 

compiler decisions.

• uses a single non-atomic CPU instruction:

RMW instructions; unaligned loads and stores.
> flag++
atomicity
A memory operation is non-atomic if it can be
observed half-complete by another thread.
An operation may be non-atomic because it:

• uses multiple CPU instructions:

operations on a large data structure; 

compiler decisions.

• uses a single non-atomic CPU instruction:

RMW instructions; unaligned loads and stores.
> flag++
An atomic operation is an “indivisible”
memory access.
In x86_64, loads, stores that are 

naturally aligned up to 64b.*
guarantees the data item fits within a cache line;

cache coherency guarantees a consistent view for a
single cache line.
* these are not the only guaranteed atomic operations.
nope; not atomic.
…idea! use a flag?
func reader() {

for {
/* If flag is 0,
can access tasks. */

if flag == 0 {

/* Set flag */
flag = 1
t := tasks.get()

...

/* Unset flag */
flag = 0
return
}
/* Else, keep looping. */ 

}
}
T1
running on CPU 1
the compiler may reorder operations.
// Sets flag to 1 & reads data.
func reader() {
flag = 1
t := tasks.get()
...
flag = 0
the processor may reorder operations.
StoreLoad reordering
load t before store flag = 1
// Sets flag to 1 & reads data.
func reader() {
flag = 1
t := tasks.get()
...
flag = 0
memory access reordering
The compiler, processor can reorder memory operations to optimize execution.
memory access reordering
The compiler, processor can reorder memory operations to optimize execution.
• The only cardinal rule is sequential consistency for single threaded programs.

• Other guarantees about compiler reordering are captured by a 

language’s memory model:

C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model:

x86_64 provides Total Store Ordering (TSO).
memory access reordering
The compiler, processor can reorder memory operations to optimize execution.
• The only cardinal rule is sequential consistency for single threaded programs.

• Other guarantees about compiler reordering are captured by a 

language’s memory model:

C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model:

x86_64 provides Total Store Ordering (TSO).
memory access reordering
The compiler, processor can reorder memory operations to optimize execution.
• The only cardinal rule is sequential consistency for single threaded programs.

• Other guarantees about compiler reordering are captured by a 

language’s memory model:

C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model:

x86_64 provides Total Store Ordering (TSO).
a relaxed consistency model.
most reorderings are invalid but StoreLoad is game;

allows processor to hide the latency of writes.
nope; not atomic and no memory order guarantees.
…idea! use a flag?
nope; not atomic and no memory order guarantees.
…idea! use a flag?
need a construct that provides atomicity and prevents memory reordering.
nope; not atomic and no memory order guarantees.
…idea! use a flag?
need a construct that provides atomicity and prevents memory reordering.
…the hardware provides!
For guaranteed atomicity and to prevent memory reordering.
special hardware instructions
x86 example:
XCHG (exchange)
these instructions are called memory barriers.
they prevent reordering by the compiler too.
x86 example: MFENCE, LFENCE, SFENCE.
special hardware instructions
The x86 LOCK instruction prefix provides both.
Used to prefix memory access instructions:
LOCK ADD
For guaranteed atomicity and to prevent memory reordering.
}
atomic operations in languages like Go:
atomic.Add
atomic.CompareAndSwap
special hardware instructions
The x86 LOCK instruction prefix provides both.
Used to prefix memory access instructions:
LOCK ADD
For guaranteed atomicity and to prevent memory reordering.
}
atomic operations in languages like Go:
atomic.Add
atomic.CompareAndSwapLOCK CMPXCHG
Atomic compare-and-swap (CAS) conditionally updates a variable:

checks if it has the expected value and if so, changes it to the desired value.
the CAS succeeded;
we set flag to 1.
flag was 1 so our CAS failed;
try again.
var flag int
var tasks Tasks
func reader() {
for {
// Try to atomically CAS flag from 0 -> 1
if atomic.CompareAndSwap(&flag, 0, 1) {
...
// Atomically set flag back to 0.
atomic.Store(&flag, 0)
return
}

// CAS failed, try again :)
}
}
baby’s first lock
var flag int
var tasks Tasks
func reader() {
for {
// Try to atomically CAS flag from 0 -> 1
if atomic.CompareAndSwap(&flag, 0, 1) {
...
// Atomically set flag back to 0.
atomic.Store(&flag, 0)
return
}

// CAS failed, try again :)
}
}
baby’s first lock: spinlocks
This is a simplified spinlock.
Spinlocks are used extensively in
the Linux kernel.}
The atomic CAS is the quintessence of any lock implementation.
cost of an atomic operation
Run on a 12-core x86_64 SMP machine.

Atomic store to a C _Atomic int, 10M times in
a tight loop.
Measure average time taken per operation

(from within the program).
With 1 thread: ~13ns (vs. regular operation: ~2ns)
With 12 cpu-pinned threads: ~110ns
threads are effectively serialized
var flag int
var tasks Tasks
func reader() {
for {
// Try to atomically CAS flag from 0 -> 1
if atomic.CompareAndSwap(&flag, 0, 1) {
...
// Atomically set flag back to 0.
atomic.Store(&flag, 0)
return
}

// CAS failed, try again :)
}
}
spinlocks
sweet.
We have a scheme for mutual exclusion that provides atomicity and
memory ordering guarantees.
sweet.
…but
spinning for long durations is wasteful; it takes away CPU time from
other threads.
We have a scheme for mutual exclusion that provides atomicity and
memory ordering guarantees.
sweet.
…but
spinning for long durations is wasteful; it takes away CPU time from
other threads.
We have a scheme for mutual exclusion that provides atomicity and
memory ordering guarantees.
enter the operating system!
Linux’s futex
Interface and mechanism for userspace code to ask the kernel to suspend/ resume threads.
futex syscall kernel-managed queue
flag can be 0: unlocked

1: locked
2: there’s a waiter
var flag int
var tasks Tasks
set flag to 2 (there’s a waiter)
flag can be 0: unlocked

1: locked
2: there’s a waiter
futex syscall to tell the kernel
to suspend us until flag changes.
when we’re resumed, we’ll CAS again.
var flag int
var tasks Tasks
func reader() {
for {
if atomic.CompareAndSwap(&flag, 0, 1) {
...
}

// CAS failed, set flag to sleeping.
v := atomic.Xchg(&flag, 2)
// and go to sleep.
futex(&flag, FUTEX_WAIT, ...)

}
}
T1’s CAS fails

(because T2 has set the flag)
T1
in the kernel:
keyA
(from the userspace address:

&flag)
keyA
T1
futex_q
1. arrange for thread to be resumed in the future:

add an entry for this thread in the kernel queue for the address we care about
in the kernel:
keyA
(from the userspace address:

&flag)
keyA
T1
futex_q
keyother
Tother
futex_q
keyother
hash(keyA)
1. arrange for thread to be resumed in the future:

add an entry for this thread in the kernel queue for the address we care about
in the kernel:
keyA
(from the userspace address:

&flag)
keyA
T1
futex_q
keyother
Tother
futex_q
keyother
hash(keyA)
1. arrange for thread to be resumed in the future:

add an entry for this thread in the kernel queue for the address we care about
2. deschedule the calling thread to suspend it.
T2 is done

(accessing the shared data)
T2
func writer() {
for {
if atomic.CompareAndSwap(&flag, 0, 1) {
... 

// Set flag to unlocked.
v := atomic.Xchg(&flag, 0)
if v == 2 {
// If there was a waiter, issue a wake up.
futex(&flag, FUTEX_WAKE, ...)
}
return
}

v := atomic.Xchg(&flag, 2)
futex(&flag, FUTEX_WAIT, …)
}
}
T2 is done

(accessing the shared data)
T2
func writer() {
for {
if atomic.CompareAndSwap(&flag, 0, 1) {
... 

// Set flag to unlocked.
v := atomic.Xchg(&flag, 0)
if v == 2 {
// If there was a waiter, issue a wake up.
futex(&flag, FUTEX_WAKE, ...)
}
return
}

v := atomic.Xchg(&flag, 2)
futex(&flag, FUTEX_WAIT, …)
}
}
if flag was 2, there’s at least one waiter
futex syscall to tell the kernel to wake
a waiter up.
func writer() {
for {
if atomic.CompareAndSwap(&flag, 0, 1) {
... 

// Set flag to unlocked.
v := atomic.Xchg(&flag, 0)
if v == 2 {
// If there was a waiter, issue a wake up.
futex(&flag, FUTEX_WAKE, ...)
}
return
}

v := atomic.Xchg(&flag, 2)
futex(&flag, FUTEX_WAIT, …)
}
}
if flag was 2, there’s at least one waiter
futex syscall to tell the kernel to wake
a waiter up.
hashes the key
walks the hash bucket’s futex queue
finds the first thread waiting on the address
schedules it to run again!
}
T2 is done

(accessing the shared data)
T2
pretty convenient!
pthread mutexes use futexes.
That was a hella simplified futex.
…but we still have a nice, lightweight primitive to build synchronization constructs.
cost of a futex
Run on a 12-core x86_64 SMP machine.

Lock & unlock a pthread mutex 10M times in loop

(lock, increment an integer, unlock).

Measure average time taken per lock/unlock pair

(from within the program).
uncontended case (1 thread): ~13ns
contended case (12 cpu-pinned threads): ~0.9us
cost of a futex
Run on a 12-core x86_64 SMP machine.

Lock & unlock a pthread mutex 10M times in loop

(lock, increment an integer, unlock).

Measure average time taken per lock/unlock pair

(from within the program).
uncontended case (1 thread): ~13ns
contended case (12 cpu-pinned threads): ~0.9us
cost of the user-space atomic CAS = ~13ns
}
cost of the atomic CAS +
syscall + thread context switch = ~0.9us
}
spinning vs. sleeping
Spinning makes sense for short durations; it keeps the thread on the CPU.
The trade-off is it uses CPU cycles not making progress.
So at some point, it makes sense to pay the cost of the context switch to go to sleep.
There are smart “hybrid” futexes:

CAS-spin a small, fixed number of times —> if that didn’t lock, make the futex syscall.
Example: the Go runtime’s futex implementation.
spinning vs. sleeping
Spinning makes sense for short durations; it keeps the thread on the CPU.
The trade-off is it uses CPU cycles not making progress.
So at some point, it makes sense to pay the cost of the context switch to go to sleep.
There are smart “hybrid” futexes:

CAS-spin a small, fixed number of times —> if that didn’t lock, make the futex syscall.
Examples: the Go runtime’s futex implementation; a variant of the pthread_mutex.
…can we do better for user-space threads?
…can we do better for user-space threads?
goroutines are user-space threads.
The go runtime multiplexes them onto threads.
lighter-weight and cheaper than threads:

goroutine switches = ~tens of ns; 

thread switches = ~a µs. CPU core
g1 g6g2
thread
CPU core } OS scheduler
Go scheduler
}
…can we do better for user-space threads?
goroutines are user-space threads.
The go runtime multiplexes them onto threads.
lighter-weight and cheaper than threads:

goroutine switches = ~tens of ns; 

thread switches = ~a µs. CPU core
g1 g6g2
thread
CPU core } OS scheduler
Go scheduler
}
we can block the goroutine without blocking the underlying thread!
to avoid the thread context switch cost.
This is what the Go runtime’s semaphore does!

The semaphore is conceptually very similar to futexes in Linux*, but it is used to 

sleep/wake goroutines:
a goroutine that blocks on a mutex is descheduled, but not the underlying thread.
the goroutine wait queues are managed by the runtime, in user-space.
* There are, of course, differences in implementation though.
the goroutine wait queues are managed
by the Go runtime, in user-space.
var flag int
var tasks Tasks
func reader() {
for {
// Attempt to CAS flag.
if atomic.CompareAndSwap(&flag, ...) {
...
}

// CAS failed; add G1 as a waiter for flag.
root.queue()
// and to sleep.
futex(&flag, FUTEX_WAIT, ...)
}
}
G1’s CAS fails

(because G2 has set the flag)
G1
&flag
(the userspace address)
&flag
G1 G3
G4
&other
hash(&flag)
}
the top-level waitlist for a hash bucket
is implemented as a treap
}
there’s a second-level wait queue 

for each unique address
the goroutine wait queues
(in user-space, managed by the go runtime)
the goroutine wait queues are managed
by the Go runtime, in user-space.
var flag int
var tasks Tasks
func reader() {
for {
// Attempt to CAS flag.
if atomic.CompareAndSwap(&flag, ...) {
...
}

// CAS failed; add G1 as a waiter for flag.
root.queue()
// and suspend G1.
gopark()
}
}
G1’s CAS fails

(because G2 has set the flag)
G1
the Go runtime deschedules the goroutine;
keeps the thread running!
G2’s done

(accessing the shared data)
G2
func writer() {
for {
if atomic.CompareAndSwap(&flag, 0, 1) {
... 

// Set flag to unlocked.
atomic.Xadd(&flag, ...)



// If there’s a waiter, reschedule it.
waiter := root.dequeue(&flag)
goready(waiter)
return
}

root.queue()
gopark()
}
}
find the first waiter goroutine and reschedule it
]
this is clever.
Avoids the hefty thread context switch cost in the contended case,

up to a point.
this is clever.
Avoids the hefty thread context switch cost in the contended case,

up to a point.
but…
func reader() {
for {
if atomic.CompareAndSwap(&flag, ...) {
...
}

// CAS failed; add G1 as a waiter for flag.
semaroot.queue()
// and suspend G1.
gopark()
}
}
once G1 is resumed, 

it will try to CAS again.
Resumed goroutines have to compete with any other goroutines trying to CAS.



They will likely lose:

there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
G1
Resumed goroutines have to compete with any other goroutines trying to CAS.



They will likely lose:

there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
// Set flag to unlocked.
atomic.Xadd(&flag, …)



// If there’s a waiter, reschedule it.
waiter := root.dequeue(&flag)
goready(waiter)
return
Resumed goroutines have to compete with any other goroutines trying to CAS.



They will likely lose:

there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
So, the semaphore implementation may end up:

• unnecessarily resuming a waiter goroutine

results in a goroutine context switch again.

• cause goroutine starvation

can result in long wait times, high tail latencies.
Resumed goroutines have to compete with any other goroutines trying to CAS.



They will likely lose:

there’s a delay between when the flag was set to 0 and this goroutine was rescheduled..
So, the semaphore implementation may end up:

• unnecessarily resuming a waiter goroutine

results in a goroutine context switch again.

• cause goroutine starvation

can result in long wait times, high tail latencies.
the sync.Mutex implementation adds a layer that fixes these.
go’s sync.Mutex
Is a hybrid lock that uses a semaphore to sleep / wake goroutines.
go’s sync.Mutex
Additionally, it tracks extra state to:
Is a hybrid lock that uses a semaphore to sleep / wake goroutines.
prevent unnecessarily waking up a goroutine

“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.

prevent severe goroutine starvation
“a waiter has been waiting”:
If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.

If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
prevent unnecessarily waking up a goroutine

“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.

prevent severe goroutine starvation
“a waiter has been waiting”:
If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.

If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
go’s sync.Mutex
Additionally, it tracks extra state to:
Is a hybrid lock that uses a semaphore to sleep / wake goroutines.
prevent unnecessarily waking up a goroutine

“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.

prevent severe goroutine starvation
“a waiter has been waiting”:
If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.

If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
other goroutines cannot CAS, they must queue
The unlock hands the mutex off to the first waiter.

i.e. the waiter does not have to compete.
how does it perform?
Run on a 12-core x86_64 SMP machine.

Lock & unlock a Go sync.Mutex 10M times in loop

(lock, increment an integer, unlock).

Measure average time taken per lock/unlock pair

(from within the program).
uncontended case (1 goroutine): ~13ns
contended case (12 goroutines): ~0.8us
how does it perform?
Contended case performance of C vs. Go:

Go initially performs better than C

but they ~converge as concurrency gets high enough.
}
how does it perform?
Contended case performance of C vs. Go:

Go initially performs better than C

but they ~converge as concurrency gets high enough.
}
}
uses a semaphore
sync.Mutex
&flag
G1 G3
G4
&other
the Go runtime semaphore’s
hash table for waiting goroutines:
each hash bucket needs a lock.
…and it’s a futex!
&flag
G1 G3
G4
&other
the Go runtime semaphore’s
hash table for waiting goroutines:
each hash bucket needs a lock.
…it’s a futex!
&flag
G1 G3
G4
&other &flag
G1
the Linux kernel’s futex hash table
for waiting threads:
each hash bucket needs a lock.
…it’s a spin lock!
each hash bucket needs a lock.
…it’s a futex!
the Go runtime semaphore’s
hash table for waiting goroutines:
&flag
G1 G3
G4
&other &flag
G1
each hash bucket needs a lock.
…it’s a spinlock!
each hash bucket needs a lock.
…it’s a futex!
the Go runtime semaphore’s
hash table for waiting goroutines:
the Linux kernel’s futex hash table
for waiting threads:
uses futexes
uses spin-locks
It’s locks all the way down!
uses a semaphore
sync.Mutex
let’s analyze its performance!
performance models for contention.
uncontended case

Cost of the atomic CAS.
contended case
In the worst-case, cost of failed atomic operations + spinning + goroutine context switch + 

thread context switch.
….But really, depends on degree of contention.
how many threads do we need to support a target throughput? 

while keeping response time the same.
how does response time change with the number of threads?
assuming a constant workload.
“How does application performance change with concurrency?”
Amdahl’s Law
Speed-up depends on the fraction of the workload that can be parallelized (p).
speed-up with N threads = 1
(1 — p) + p
N
a simple experiment
Measure time taken to complete a fixed workload.

serial fraction holds a lock (sync.Mutex).
scale parallel fraction (p) from 0.25 to 0.75
measure time taken for number of goroutines (N) = 1 —> 12.
p = 0.75
p = 0.25
Amdahl’s Law
Speed-up depends on the fraction of the workload that can be parallelized (p).
Universal Scalability Law (USL)
• contention penalty

due to serialization for shared resources.

examples: lock contention, database
contention.

• crosstalk penalty

due to coordination for coherence.
examples: servers coordinating to synchronize

mutable state.
αN
Scalability depends on contention and cross-talk.
Universal Scalability Law (USL)
• contention penalty

due to serialization for shared resources.

examples: lock contention, database
contention.

• crosstalk penalty

due to coordination for coherence.
examples: servers coordinating to synchronize

mutable state.
αN
Scalability depends on contention and cross-talk.
βN2
Universal Scalability Law (USL)
N
(αN + βN2 + C)
N
C
N
(αN + C)
contention and crosstalk
linear scaling
contention
throughput
concurrency
throughput of N threads = N
(αN + βN2 + C)
p = 0.75p = 0.25
USL curves
plotted using the R usl package
p = parallel fraction of workload
let’s use it, smartly!
a few closing strategies.
but first, profile!
Go mutex
• Go mutex contention profiler

https://golang.org/doc/diagnostics.html
Linux
• perf-lock:

perf examples by Brendan Gregg

Brendan Gregg article on off-cpu analysis
• eBPF:

example bcc tool to measure user lock contention
• Dtrace, systemtap
• mutrace, Valgrind-drd

pprof mutex contention profile
strategy I: don’t use a lock
• remove the need for synchronization from hot-paths:

typically involves rearchitecting.
• reduce the number of lock operations:

doing more thread local work, buffering, batching, copy-on-write.
• use atomic operations.
• use lock-free data structures

see: http://www.1024cores.net/
strategy II: granular locks
• shard data:

but ensure no false sharing, by padding to cache line size.

examples: 

go runtime semaphore’s hash table buckets;

Linux scheduler’s per-CPU runqueues;

Go scheduler’s per-CPU runqueues;
• use read-write locks
scheduler benchmark
(CreateGoroutineParallel)
modified scheduler: global lock; runqueue
go scheduler: per-CPU core, lock-free runqueues
strategy III: do less serial work
lock contention causes ~10x latency
latency
time time
smaller critical section change
• move computation out of critical section:

typically involves rearchitecting.
bonus strategy:
• contention-aware schedulers
example: Contention-aware scheduling in MySQL 8.0 Innodb
Special thanks to Eben Freeman, Justin Delegard, Austin Duffield for reading drafts of this.
@kavya719
speakerdeck.com/kavya719/lets-talk-locks
References

Jeff Preshing’s excellent blog series

Memory Barriers: A Hardware View for Software Hackers

LWN.net on futexes

The Go source code
The Universal Scalability Law Manifesto, Neil Gunther
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
go-locks/

More Related Content

What's hot

Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
Kernel TLV
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
MongoDB
 
Let's Learn to Talk to GC Logs in Java 9
Let's Learn to Talk to GC Logs in Java 9Let's Learn to Talk to GC Logs in Java 9
Let's Learn to Talk to GC Logs in Java 9
Poonam Bajaj Parhar
 
How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15
oysteing
 
Embedded Rust on IoT devices
Embedded Rust on IoT devicesEmbedded Rust on IoT devices
Embedded Rust on IoT devices
Lars Gregori
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
 
How to use histograms to get better performance
How to use histograms to get better performanceHow to use histograms to get better performance
How to use histograms to get better performance
MariaDB plc
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
pradeepfn
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVM
Igor Veresov
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
 
Pentesting like a grandmaster BSides London 2013
Pentesting like a grandmaster BSides London 2013Pentesting like a grandmaster BSides London 2013
Pentesting like a grandmaster BSides London 2013
Abraham Aranguren
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good?
Alkin Tezuysal
 
Logstash
LogstashLogstash
Logstash
琛琳 饶
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDK
Kernel TLV
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Binderのはじめの一歩とAndroid
Binderのはじめの一歩とAndroidBinderのはじめの一歩とAndroid
Binderのはじめの一歩とAndroidl_b__
 
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
NTT DATA Technology & Innovation
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 

What's hot (20)

Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
Let's Learn to Talk to GC Logs in Java 9
Let's Learn to Talk to GC Logs in Java 9Let's Learn to Talk to GC Logs in Java 9
Let's Learn to Talk to GC Logs in Java 9
 
How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15
 
Embedded Rust on IoT devices
Embedded Rust on IoT devicesEmbedded Rust on IoT devices
Embedded Rust on IoT devices
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
How to use histograms to get better performance
How to use histograms to get better performanceHow to use histograms to get better performance
How to use histograms to get better performance
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVM
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Pentesting like a grandmaster BSides London 2013
Pentesting like a grandmaster BSides London 2013Pentesting like a grandmaster BSides London 2013
Pentesting like a grandmaster BSides London 2013
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good?
 
Logstash
LogstashLogstash
Logstash
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDK
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Binderのはじめの一歩とAndroid
Binderのはじめの一歩とAndroidBinderのはじめの一歩とAndroid
Binderのはじめの一歩とAndroid
 
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 

Similar to Let's Talk Locks!

Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
Daniel Blezek
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
Sasha Kravchuk
 
concurrency
concurrencyconcurrency
concurrency
Jonathan Wagoner
 
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Jérôme Petazzoni
 
Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?
Docker, Inc.
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Pier Luca Lanzi
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containers
Nitish Jadia
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
dotCloud
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 
MyShell - English
MyShell - EnglishMyShell - English
MyShell - English
Johnnatan Messias
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
Moabi.com
 
Asynchronous Python A Gentle Introduction
Asynchronous Python A Gentle IntroductionAsynchronous Python A Gentle Introduction
Asynchronous Python A Gentle Introduction
PyData
 
Here comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfHere comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdf
Krystian Zybała
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
rkr10
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexing
Mark Veltzer
 
Lightweight Virtualization: LXC containers & AUFS
Lightweight Virtualization: LXC containers & AUFSLightweight Virtualization: LXC containers & AUFS
Lightweight Virtualization: LXC containers & AUFS
Jérôme Petazzoni
 
Node.js Course 1 of 2 - Introduction and first steps
Node.js Course 1 of 2 - Introduction and first stepsNode.js Course 1 of 2 - Introduction and first steps
Node.js Course 1 of 2 - Introduction and first steps
Manuel Eusebio de Paz Carmona
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
Yandex
 
The art of concurrent programming
The art of concurrent programmingThe art of concurrent programming
The art of concurrent programming
Iskren Chernev
 
Programming with Threads in Java
Programming with Threads in JavaProgramming with Threads in Java
Programming with Threads in Java
koji lin
 

Similar to Let's Talk Locks! (20)

Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
concurrency
concurrencyconcurrency
concurrency
 
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
 
Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containers
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
 
MyShell - English
MyShell - EnglishMyShell - English
MyShell - English
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Asynchronous Python A Gentle Introduction
Asynchronous Python A Gentle IntroductionAsynchronous Python A Gentle Introduction
Asynchronous Python A Gentle Introduction
 
Here comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfHere comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdf
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexing
 
Lightweight Virtualization: LXC containers & AUFS
Lightweight Virtualization: LXC containers & AUFSLightweight Virtualization: LXC containers & AUFS
Lightweight Virtualization: LXC containers & AUFS
 
Node.js Course 1 of 2 - Introduction and first steps
Node.js Course 1 of 2 - Introduction and first stepsNode.js Course 1 of 2 - Introduction and first steps
Node.js Course 1 of 2 - Introduction and first steps
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
 
The art of concurrent programming
The art of concurrent programmingThe art of concurrent programming
The art of concurrent programming
 
Programming with Threads in Java
Programming with Threads in JavaProgramming with Threads in Java
Programming with Threads in Java
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 

Recently uploaded (20)

AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 

Let's Talk Locks!

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ go-locks/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 7. “locks are slow” lock contention causes ~10x latency latency(ms) time
  • 8. “locks are slow” …but they’re used everywhere. from schedulers to databases and web servers. lock contention causes ~10x latency latency(ms) time
  • 9. “locks are slow” …but they’re used everywhere. from schedulers to databases and web servers. lock contention causes ~10x latency latency(ms) time ?
  • 10. let’s analyze its performance! performance models for contention let’s build a lock! a tour through lock internals let’s use it, smartly! a few closing strategies
  • 11. our case-study Lock implementations are hardware, ISA, OS and language specific:
 
 We assume an x86_64 SMP machine running a modern Linux.
 We’ll look at the lock implementation in Go 1.12. CPU 0 CPU 1 cache cache interconnect memory simplified SMP system diagram
  • 12. use as you would threads 
 > go handle_request(r)
 but user-space threads:
 managed entirely by the Go runtime, not the operating system. The unit of concurrent execution: goroutines. a brief go primer
  • 13. use as you would threads 
 > go handle_request(r)
 but user-space threads:
 managed entirely by the Go runtime, not the operating system. The unit of concurrent execution: goroutines. a brief go primer Data shared between goroutines must be synchronized. One way is to use the blocking, non-recursive lock construct: > var mu sync.Mutex
 mu.Lock()
 … mu.Unlock()
  • 14. let’s build a lock! a tour through lock internals.
  • 15. want: “mutual exclusion” only one thread has access to shared data at any given time
  • 16. T1 running on CPU 1 T2 running on CPU 2 func reader() {
 // Read a task
 t := tasks.get()
 
 // Do something with it.
 ... } func writer() {
 // Write to tasks
 tasks.put(t) } // track whether tasks is // available (0) or not (1) // shared ring buffer var tasks Tasks want: “mutual exclusion” only one thread has access to shared data at any given time
  • 17. func reader() {
 // Read a task
 t := tasks.get()
 
 // Do something with it.
 ... } func writer() {
 // Write to tasks
 tasks.put(t) } // track whether tasks is // available (0) or not (1) // shared ring buffer var tasks Tasks want: “mutual exclusion” …idea! use a flag? T1 running on CPU 1 T2 running on CPU 2
  • 18. // track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks
  • 19. // track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks func reader() {
 for { /* If flag is 0, can access tasks. */
 if flag == 0 {
 /* Set flag */ flag++ ...
 /* Unset flag */ flag-- return } /* Else, keep looping. */ 
 } } T1 running on CPU 1
  • 20. // track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks func reader() {
 for { /* If flag is 0, can access tasks. */
 if flag == 0 {
 /* Set flag */ flag++ ...
 /* Unset flag */ flag-- return } /* Else, keep looping. */ 
 } } func writer() {
 for { /* If flag is 0, can access tasks. */
 if flag == 0 {
 /* Set flag */ flag++ ...
 /* Unset flag */ flag-- return } /* Else, keep looping. */ 
 } } T1 running on CPU 1 T2 running on CPU 2
  • 21. // track whether tasks can be // accessed (0) or not (1) var flag int var tasks Tasks func reader() {
 for { /* If flag is 0, can access tasks. */
 if flag == 0 {
 /* Set flag */ flag++ ...
 /* Unset flag */ flag-- return } /* Else, keep looping. */ 
 } } func writer() {
 for { /* If flag is 0, can access tasks. */
 if flag == 0 {
 /* Set flag */ flag++ ...
 /* Unset flag */ flag-- return } /* Else, keep looping. */ 
 } } T1 running on CPU 1 T2 running on CPU 2
  • 23. flag++ CPU memory 1. Read (0) 2. Modify 3. Write (1) T1 running on CPU 1
  • 25. R R W flag++ if flag == 0 timeline of memory operations T1 running on CPU 1 T2 running on CPU 2 T2 may observe T1’s RMW half-complete
  • 26. atomicity A memory operation is non-atomic if it can be observed half-complete by another thread. An operation may be non-atomic because it:
 • uses multiple CPU instructions:
 operations on a large data structure; 
 compiler decisions.
 • use a single non-atomic CPU instruction:
 RMW instructions; unaligned loads and stores. > o := Order { id: 10, name: “yogi bear”, order: “pie”, count: 3, }
  • 27. atomicity A memory operation is non-atomic if it can be observed half-complete by another thread. An operation may be non-atomic because it:
 • uses multiple CPU instructions:
 operations on a large data structure; 
 compiler decisions.
 • uses a single non-atomic CPU instruction:
 RMW instructions; unaligned loads and stores. > flag++
  • 28. atomicity A memory operation is non-atomic if it can be observed half-complete by another thread. An operation may be non-atomic because it:
 • uses multiple CPU instructions:
 operations on a large data structure; 
 compiler decisions.
 • uses a single non-atomic CPU instruction:
 RMW instructions; unaligned loads and stores. > flag++ An atomic operation is an “indivisible” memory access. In x86_64, loads, stores that are 
 naturally aligned up to 64b.* guarantees the data item fits within a cache line;
 cache coherency guarantees a consistent view for a single cache line. * these are not the only guaranteed atomic operations.
  • 30. func reader() {
 for { /* If flag is 0, can access tasks. */
 if flag == 0 {
 /* Set flag */ flag = 1 t := tasks.get()
 ...
 /* Unset flag */ flag = 0 return } /* Else, keep looping. */ 
 } } T1 running on CPU 1
  • 31. the compiler may reorder operations. // Sets flag to 1 & reads data. func reader() { flag = 1 t := tasks.get() ... flag = 0
  • 32. the processor may reorder operations. StoreLoad reordering load t before store flag = 1 // Sets flag to 1 & reads data. func reader() { flag = 1 t := tasks.get() ... flag = 0
  • 33. memory access reordering The compiler, processor can reorder memory operations to optimize execution.
  • 34. memory access reordering The compiler, processor can reorder memory operations to optimize execution. • The only cardinal rule is sequential consistency for single threaded programs.
 • Other guarantees about compiler reordering are captured by a 
 language’s memory model:
 C++, Go guarantee data-race free programs will be sequentially consistent. • For processor reordering, by the hardware memory model:
 x86_64 provides Total Store Ordering (TSO).
  • 35. memory access reordering The compiler, processor can reorder memory operations to optimize execution. • The only cardinal rule is sequential consistency for single threaded programs.
 • Other guarantees about compiler reordering are captured by a 
 language’s memory model:
 C++, Go guarantee data-race free programs will be sequentially consistent. • For processor reordering, by the hardware memory model:
 x86_64 provides Total Store Ordering (TSO).
  • 36. memory access reordering The compiler, processor can reorder memory operations to optimize execution. • The only cardinal rule is sequential consistency for single threaded programs.
 • Other guarantees about compiler reordering are captured by a 
 language’s memory model:
 C++, Go guarantee data-race free programs will be sequentially consistent. • For processor reordering, by the hardware memory model:
 x86_64 provides Total Store Ordering (TSO). a relaxed consistency model. most reorderings are invalid but StoreLoad is game;
 allows processor to hide the latency of writes.
  • 37. nope; not atomic and no memory order guarantees. …idea! use a flag?
  • 38. nope; not atomic and no memory order guarantees. …idea! use a flag? need a construct that provides atomicity and prevents memory reordering.
  • 39. nope; not atomic and no memory order guarantees. …idea! use a flag? need a construct that provides atomicity and prevents memory reordering. …the hardware provides!
  • 40. For guaranteed atomicity and to prevent memory reordering. special hardware instructions x86 example: XCHG (exchange) these instructions are called memory barriers. they prevent reordering by the compiler too. x86 example: MFENCE, LFENCE, SFENCE.
  • 41. special hardware instructions The x86 LOCK instruction prefix provides both. Used to prefix memory access instructions: LOCK ADD For guaranteed atomicity and to prevent memory reordering. } atomic operations in languages like Go: atomic.Add atomic.CompareAndSwap
  • 42. special hardware instructions The x86 LOCK instruction prefix provides both. Used to prefix memory access instructions: LOCK ADD For guaranteed atomicity and to prevent memory reordering. } atomic operations in languages like Go: atomic.Add atomic.CompareAndSwapLOCK CMPXCHG Atomic compare-and-swap (CAS) conditionally updates a variable:
 checks if it has the expected value and if so, changes it to the desired value.
  • 43. the CAS succeeded; we set flag to 1. flag was 1 so our CAS failed; try again. var flag int var tasks Tasks func reader() { for { // Try to atomically CAS flag from 0 -> 1 if atomic.CompareAndSwap(&flag, 0, 1) { ... // Atomically set flag back to 0. atomic.Store(&flag, 0) return }
 // CAS failed, try again :) } } baby’s first lock
  • 44. var flag int var tasks Tasks func reader() { for { // Try to atomically CAS flag from 0 -> 1 if atomic.CompareAndSwap(&flag, 0, 1) { ... // Atomically set flag back to 0. atomic.Store(&flag, 0) return }
 // CAS failed, try again :) } } baby’s first lock: spinlocks This is a simplified spinlock. Spinlocks are used extensively in the Linux kernel.}
  • 45. The atomic CAS is the quintessence of any lock implementation.
  • 46. cost of an atomic operation Run on a 12-core x86_64 SMP machine.
 Atomic store to a C _Atomic int, 10M times in a tight loop. Measure average time taken per operation
 (from within the program). With 1 thread: ~13ns (vs. regular operation: ~2ns) With 12 cpu-pinned threads: ~110ns threads are effectively serialized var flag int var tasks Tasks func reader() { for { // Try to atomically CAS flag from 0 -> 1 if atomic.CompareAndSwap(&flag, 0, 1) { ... // Atomically set flag back to 0. atomic.Store(&flag, 0) return }
 // CAS failed, try again :) } } spinlocks
  • 47. sweet. We have a scheme for mutual exclusion that provides atomicity and memory ordering guarantees.
  • 48. sweet. …but spinning for long durations is wasteful; it takes away CPU time from other threads. We have a scheme for mutual exclusion that provides atomicity and memory ordering guarantees.
  • 49. sweet. …but spinning for long durations is wasteful; it takes away CPU time from other threads. We have a scheme for mutual exclusion that provides atomicity and memory ordering guarantees. enter the operating system!
  • 50. Linux’s futex Interface and mechanism for userspace code to ask the kernel to suspend/ resume threads. futex syscall kernel-managed queue
  • 51. flag can be 0: unlocked
 1: locked 2: there’s a waiter var flag int var tasks Tasks
  • 52. set flag to 2 (there’s a waiter) flag can be 0: unlocked
 1: locked 2: there’s a waiter futex syscall to tell the kernel to suspend us until flag changes. when we’re resumed, we’ll CAS again. var flag int var tasks Tasks func reader() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... }
 // CAS failed, set flag to sleeping. v := atomic.Xchg(&flag, 2) // and go to sleep. futex(&flag, FUTEX_WAIT, ...)
 } } T1’s CAS fails
 (because T2 has set the flag) T1
  • 53. in the kernel: keyA (from the userspace address:
 &flag) keyA T1 futex_q 1. arrange for thread to be resumed in the future:
 add an entry for this thread in the kernel queue for the address we care about
  • 54. in the kernel: keyA (from the userspace address:
 &flag) keyA T1 futex_q keyother Tother futex_q keyother hash(keyA) 1. arrange for thread to be resumed in the future:
 add an entry for this thread in the kernel queue for the address we care about
  • 55. in the kernel: keyA (from the userspace address:
 &flag) keyA T1 futex_q keyother Tother futex_q keyother hash(keyA) 1. arrange for thread to be resumed in the future:
 add an entry for this thread in the kernel queue for the address we care about 2. deschedule the calling thread to suspend it.
  • 56. T2 is done
 (accessing the shared data) T2 func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... 
 // Set flag to unlocked. v := atomic.Xchg(&flag, 0) if v == 2 { // If there was a waiter, issue a wake up. futex(&flag, FUTEX_WAKE, ...) } return }
 v := atomic.Xchg(&flag, 2) futex(&flag, FUTEX_WAIT, …) } }
  • 57. T2 is done
 (accessing the shared data) T2 func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... 
 // Set flag to unlocked. v := atomic.Xchg(&flag, 0) if v == 2 { // If there was a waiter, issue a wake up. futex(&flag, FUTEX_WAKE, ...) } return }
 v := atomic.Xchg(&flag, 2) futex(&flag, FUTEX_WAIT, …) } } if flag was 2, there’s at least one waiter futex syscall to tell the kernel to wake a waiter up.
  • 58. func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... 
 // Set flag to unlocked. v := atomic.Xchg(&flag, 0) if v == 2 { // If there was a waiter, issue a wake up. futex(&flag, FUTEX_WAKE, ...) } return }
 v := atomic.Xchg(&flag, 2) futex(&flag, FUTEX_WAIT, …) } } if flag was 2, there’s at least one waiter futex syscall to tell the kernel to wake a waiter up. hashes the key walks the hash bucket’s futex queue finds the first thread waiting on the address schedules it to run again! } T2 is done
 (accessing the shared data) T2
  • 59. pretty convenient! pthread mutexes use futexes. That was a hella simplified futex. …but we still have a nice, lightweight primitive to build synchronization constructs.
  • 60. cost of a futex Run on a 12-core x86_64 SMP machine.
 Lock & unlock a pthread mutex 10M times in loop
 (lock, increment an integer, unlock).
 Measure average time taken per lock/unlock pair
 (from within the program). uncontended case (1 thread): ~13ns contended case (12 cpu-pinned threads): ~0.9us
  • 61. cost of a futex Run on a 12-core x86_64 SMP machine.
 Lock & unlock a pthread mutex 10M times in loop
 (lock, increment an integer, unlock).
 Measure average time taken per lock/unlock pair
 (from within the program). uncontended case (1 thread): ~13ns contended case (12 cpu-pinned threads): ~0.9us cost of the user-space atomic CAS = ~13ns } cost of the atomic CAS + syscall + thread context switch = ~0.9us }
  • 62. spinning vs. sleeping Spinning makes sense for short durations; it keeps the thread on the CPU. The trade-off is it uses CPU cycles not making progress. So at some point, it makes sense to pay the cost of the context switch to go to sleep. There are smart “hybrid” futexes:
 CAS-spin a small, fixed number of times —> if that didn’t lock, make the futex syscall. Example: the Go runtime’s futex implementation.
  • 63. spinning vs. sleeping Spinning makes sense for short durations; it keeps the thread on the CPU. The trade-off is it uses CPU cycles not making progress. So at some point, it makes sense to pay the cost of the context switch to go to sleep. There are smart “hybrid” futexes:
 CAS-spin a small, fixed number of times —> if that didn’t lock, make the futex syscall. Examples: the Go runtime’s futex implementation; a variant of the pthread_mutex.
  • 64. …can we do better for user-space threads?
  • 65. …can we do better for user-space threads? goroutines are user-space threads. The go runtime multiplexes them onto threads. lighter-weight and cheaper than threads:
 goroutine switches = ~tens of ns; 
 thread switches = ~a µs. CPU core g1 g6g2 thread CPU core } OS scheduler Go scheduler }
  • 66. …can we do better for user-space threads? goroutines are user-space threads. The go runtime multiplexes them onto threads. lighter-weight and cheaper than threads:
 goroutine switches = ~tens of ns; 
 thread switches = ~a µs. CPU core g1 g6g2 thread CPU core } OS scheduler Go scheduler } we can block the goroutine without blocking the underlying thread! to avoid the thread context switch cost.
  • 67. This is what the Go runtime’s semaphore does!
 The semaphore is conceptually very similar to futexes in Linux*, but it is used to 
 sleep/wake goroutines: a goroutine that blocks on a mutex is descheduled, but not the underlying thread. the goroutine wait queues are managed by the runtime, in user-space. * There are, of course, differences in implementation though.
  • 68. the goroutine wait queues are managed by the Go runtime, in user-space. var flag int var tasks Tasks func reader() { for { // Attempt to CAS flag. if atomic.CompareAndSwap(&flag, ...) { ... }
 // CAS failed; add G1 as a waiter for flag. root.queue() // and to sleep. futex(&flag, FUTEX_WAIT, ...) } } G1’s CAS fails
 (because G2 has set the flag) G1
  • 69. &flag (the userspace address) &flag G1 G3 G4 &other hash(&flag) } the top-level waitlist for a hash bucket is implemented as a treap } there’s a second-level wait queue 
 for each unique address the goroutine wait queues (in user-space, managed by the go runtime)
  • 70. the goroutine wait queues are managed by the Go runtime, in user-space. var flag int var tasks Tasks func reader() { for { // Attempt to CAS flag. if atomic.CompareAndSwap(&flag, ...) { ... }
 // CAS failed; add G1 as a waiter for flag. root.queue() // and suspend G1. gopark() } } G1’s CAS fails
 (because G2 has set the flag) G1 the Go runtime deschedules the goroutine; keeps the thread running!
  • 71. G2’s done
 (accessing the shared data) G2 func writer() { for { if atomic.CompareAndSwap(&flag, 0, 1) { ... 
 // Set flag to unlocked. atomic.Xadd(&flag, ...)
 
 // If there’s a waiter, reschedule it. waiter := root.dequeue(&flag) goready(waiter) return }
 root.queue() gopark() } } find the first waiter goroutine and reschedule it ]
  • 72. this is clever. Avoids the hefty thread context switch cost in the contended case,
 up to a point.
  • 73. this is clever. Avoids the hefty thread context switch cost in the contended case,
 up to a point. but…
  • 74. func reader() { for { if atomic.CompareAndSwap(&flag, ...) { ... }
 // CAS failed; add G1 as a waiter for flag. semaroot.queue() // and suspend G1. gopark() } } once G1 is resumed, 
 it will try to CAS again. Resumed goroutines have to compete with any other goroutines trying to CAS.
 
 They will likely lose:
 there’s a delay between when the flag was set to 0 and this goroutine was rescheduled.. G1
  • 75. Resumed goroutines have to compete with any other goroutines trying to CAS.
 
 They will likely lose:
 there’s a delay between when the flag was set to 0 and this goroutine was rescheduled.. // Set flag to unlocked. atomic.Xadd(&flag, …)
 
 // If there’s a waiter, reschedule it. waiter := root.dequeue(&flag) goready(waiter) return
  • 76. Resumed goroutines have to compete with any other goroutines trying to CAS.
 
 They will likely lose:
 there’s a delay between when the flag was set to 0 and this goroutine was rescheduled.. So, the semaphore implementation may end up:
 • unnecessarily resuming a waiter goroutine
 results in a goroutine context switch again.
 • cause goroutine starvation
 can result in long wait times, high tail latencies.
  • 77. Resumed goroutines have to compete with any other goroutines trying to CAS.
 
 They will likely lose:
 there’s a delay between when the flag was set to 0 and this goroutine was rescheduled.. So, the semaphore implementation may end up:
 • unnecessarily resuming a waiter goroutine
 results in a goroutine context switch again.
 • cause goroutine starvation
 can result in long wait times, high tail latencies. the sync.Mutex implementation adds a layer that fixes these.
  • 78. go’s sync.Mutex Is a hybrid lock that uses a semaphore to sleep / wake goroutines.
  • 79. go’s sync.Mutex Additionally, it tracks extra state to: Is a hybrid lock that uses a semaphore to sleep / wake goroutines. prevent unnecessarily waking up a goroutine
 “There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.
 prevent severe goroutine starvation “a waiter has been waiting”: If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.
 If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”. prevent unnecessarily waking up a goroutine
 “There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.
 prevent severe goroutine starvation “a waiter has been waiting”: If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.
 If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.
  • 80. go’s sync.Mutex Additionally, it tracks extra state to: Is a hybrid lock that uses a semaphore to sleep / wake goroutines. prevent unnecessarily waking up a goroutine
 “There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter.
 prevent severe goroutine starvation “a waiter has been waiting”: If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue.
 If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”. other goroutines cannot CAS, they must queue The unlock hands the mutex off to the first waiter.
 i.e. the waiter does not have to compete.
  • 81. how does it perform? Run on a 12-core x86_64 SMP machine.
 Lock & unlock a Go sync.Mutex 10M times in loop
 (lock, increment an integer, unlock).
 Measure average time taken per lock/unlock pair
 (from within the program). uncontended case (1 goroutine): ~13ns contended case (12 goroutines): ~0.8us
  • 82. how does it perform? Contended case performance of C vs. Go:
 Go initially performs better than C
 but they ~converge as concurrency gets high enough. }
  • 83. how does it perform? Contended case performance of C vs. Go:
 Go initially performs better than C
 but they ~converge as concurrency gets high enough. } }
  • 85. &flag G1 G3 G4 &other the Go runtime semaphore’s hash table for waiting goroutines: each hash bucket needs a lock. …and it’s a futex!
  • 86. &flag G1 G3 G4 &other the Go runtime semaphore’s hash table for waiting goroutines: each hash bucket needs a lock. …it’s a futex!
  • 87. &flag G1 G3 G4 &other &flag G1 the Linux kernel’s futex hash table for waiting threads: each hash bucket needs a lock. …it’s a spin lock! each hash bucket needs a lock. …it’s a futex! the Go runtime semaphore’s hash table for waiting goroutines:
  • 88. &flag G1 G3 G4 &other &flag G1 each hash bucket needs a lock. …it’s a spinlock! each hash bucket needs a lock. …it’s a futex! the Go runtime semaphore’s hash table for waiting goroutines: the Linux kernel’s futex hash table for waiting threads:
  • 89. uses futexes uses spin-locks It’s locks all the way down! uses a semaphore sync.Mutex
  • 90. let’s analyze its performance! performance models for contention.
  • 91. uncontended case
 Cost of the atomic CAS. contended case In the worst-case, cost of failed atomic operations + spinning + goroutine context switch + 
 thread context switch. ….But really, depends on degree of contention.
  • 92. how many threads do we need to support a target throughput? 
 while keeping response time the same. how does response time change with the number of threads? assuming a constant workload. “How does application performance change with concurrency?”
  • 93. Amdahl’s Law Speed-up depends on the fraction of the workload that can be parallelized (p). speed-up with N threads = 1 (1 — p) + p N
  • 94. a simple experiment Measure time taken to complete a fixed workload.
 serial fraction holds a lock (sync.Mutex). scale parallel fraction (p) from 0.25 to 0.75 measure time taken for number of goroutines (N) = 1 —> 12.
  • 95. p = 0.75 p = 0.25 Amdahl’s Law Speed-up depends on the fraction of the workload that can be parallelized (p).
  • 96. Universal Scalability Law (USL) • contention penalty
 due to serialization for shared resources.
 examples: lock contention, database contention.
 • crosstalk penalty
 due to coordination for coherence. examples: servers coordinating to synchronize
 mutable state. αN Scalability depends on contention and cross-talk.
  • 97. Universal Scalability Law (USL) • contention penalty
 due to serialization for shared resources.
 examples: lock contention, database contention.
 • crosstalk penalty
 due to coordination for coherence. examples: servers coordinating to synchronize
 mutable state. αN Scalability depends on contention and cross-talk. βN2
  • 98. Universal Scalability Law (USL) N (αN + βN2 + C) N C N (αN + C) contention and crosstalk linear scaling contention throughput concurrency throughput of N threads = N (αN + βN2 + C)
  • 99. p = 0.75p = 0.25 USL curves plotted using the R usl package p = parallel fraction of workload
  • 100. let’s use it, smartly! a few closing strategies.
  • 101. but first, profile! Go mutex • Go mutex contention profiler
 https://golang.org/doc/diagnostics.html Linux • perf-lock:
 perf examples by Brendan Gregg
 Brendan Gregg article on off-cpu analysis • eBPF:
 example bcc tool to measure user lock contention • Dtrace, systemtap • mutrace, Valgrind-drd
 pprof mutex contention profile
  • 102. strategy I: don’t use a lock • remove the need for synchronization from hot-paths:
 typically involves rearchitecting. • reduce the number of lock operations:
 doing more thread local work, buffering, batching, copy-on-write. • use atomic operations. • use lock-free data structures
 see: http://www.1024cores.net/
  • 103. strategy II: granular locks • shard data:
 but ensure no false sharing, by padding to cache line size.
 examples: 
 go runtime semaphore’s hash table buckets;
 Linux scheduler’s per-CPU runqueues;
 Go scheduler’s per-CPU runqueues; • use read-write locks scheduler benchmark (CreateGoroutineParallel) modified scheduler: global lock; runqueue go scheduler: per-CPU core, lock-free runqueues
  • 104. strategy III: do less serial work lock contention causes ~10x latency latency time time smaller critical section change • move computation out of critical section:
 typically involves rearchitecting.
  • 105. bonus strategy: • contention-aware schedulers example: Contention-aware scheduling in MySQL 8.0 Innodb
  • 106. Special thanks to Eben Freeman, Justin Delegard, Austin Duffield for reading drafts of this. @kavya719 speakerdeck.com/kavya719/lets-talk-locks References
 Jeff Preshing’s excellent blog series
 Memory Barriers: A Hardware View for Software Hackers
 LWN.net on futexes
 The Go source code The Universal Scalability Law Manifesto, Neil Gunther
  • 107. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ go-locks/