Concurrency on the JVM showing the nuts and bolts of Akka (I presume .. it's not first-hand stuff I'm saying, just speculating). Java Memory Model, Thread Pools, Actors and the likes of that will be covered.
Concurrency on the JVM
.. or some of the nuts and bolts of Akka
23 July 2013
• In general, a (random) selection of (more or less loosely coupled)
points I would like to address
• Low-level concurrency - only once you understand
the complexity will you appreciate the solution :)
• Thread pools, contention issues around them and the
enlightened path to Akka
• What’s missing: A lot. Software-transactional
memory (Clojure), data-flow concurrency, futures
and more theory I wanted to cover
We’ll focus on utilisation
• “The number of idle cores on my machine doubles
every two years” - Sander Mak (DZone interview)
• Distinction between low latency (produce one answer fast) and
high throughput (produce lots of answers fast) somewhat fuzzy
• Locks are not expensive, lock contention is - don’t
shoot the messenger!
Does contention on junctions arise because of traffic lights or because of bad traffic planning?
• Most locking in Java programs is not only
uncontended, but also unshared
• Rule of thumb: Think about contention first, and
only then worry about your locking.
See also: Brian Goetz, “Threading lightly, Part 1: Synchronization is not the enemy”
Note: When benchmarking your application, don’t just deliberately provoke
contention when it wouldn’t arise otherwise!
Synchronisation is not the enemy
Synchronised on the JVM
• Optimised for the uncontended case (i.e. the usual one) - can
be handled entirely within the JVM (i.e. no OS calls)
• Lightweight locking based on CAS instructions
Implementation of thin locks on IBM’s
version of the JDK 1.1.2 for AIX (yes, yes, ..
totally outdated, but you get the idea ..)
See also: http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon98Thin.pdf
Locking benchmarks (1)
Shamelessly taken from: http://www.ibm.com/developerworks/library/j-jtp11234/
Not the code used in the
benchmarks(!) - this is just
to illustrate it (and to show
off my uber non-blocking
* You are allowed to keep any bugs
you find at your discretion.
Locking benchmarks (2)
See also: Brian Goetz, “Java Concurrency in Practice”, Chapter 15
“With low to moderate contention, atomics offer better
scalability; with high contention, locks offer better
contention avoidance.” - Brian Goetz
• Think roundabouts vs. traffic lights
• Benchmark is deceptive as it produces
an unusually high amount of
contention. Atomics scale quite nicely
• Actual lesson learned: Always measure
yourself before you assume anything!
There are no general performance
2 4 8 16 32 64
2 4 8 16 32 64
Note: Graph is not based on values I measured, it’s from JCIP .. and I didn’t use a ruler
to measure points in the pictures. It’s not correct and doesn’t aim to be!
• Incohesive classes tend to increase lock granularity ..
• .. at least make your locks cohesive by splitting them
(even better: write cohesive classes to begin with!)
• Only a short-term solution to contention - in this
case: as soon as you double the load, it’ll be the same
See also: Brian Goetz, “Java Concurrency in Practice”, Chapter 11
• Extends the lock splitting idea, but works on
partitions of variable-sized data
• Classic example: ConcurrentHashMap - 16 buckets
with respective locks rather than one “global” lock
• Depends on the number available processors and the
likelihood they’ll end up locking the same partition
(e.g. non uniformly distributed data)
• To some extent also a trade-off between memory
and performance (e.g. do you really need 16 buckets with
ConcurrentHashMaps? they’re not that cheap!)
See also: http://ria101.wordpress.com/2011/12/12/concurrenthashmap-avoid-a-common-misuse/
and of course “Java Concurrency in Practice”, Chapter 11
Layers of synchronisation
• High-level concurrency abstractions
• Low-level locking (synchronized() blocks and util.concurrent.locks)
• Low-level primitives (volatile variables, util.concurrent.atomic classes)
• Data races: deliberate undersynchronisation (Avoid!)
Shamelessly taken from: Jeremy Manson, “Advanced Topics in Programming Languages: The Java Memory Model”
Let’s take a step back for a moment ..
two distinct issues
• Thread-interference or atomicity
• Visibility, ordering and memory consistency
(i.e. what volatile is about)
Quantum concurrency and
Schrödinger’s memory tricks:
The thread we’ll use to observe the value of
the counter has an effect on the observation!
Why is this code broken?
• Double-checked locking and concurrent collections,
so what’s the problem then? (don’t argue about whether or not caches
should preload everything up-front - fair point, but that’s not the issue here)
Can you see it now?
• Semantically speaking, this is exactly the same code
• The compiler, the JVM, the operating system & even the CPU conspire behind your back
against you in the Extraordinary League of Ordinary Things That Will Mess You Up!
Most likely, they’re sinister enough to wait until you deploy to production before they show their
See also: Most/many double-checked locking implementations around Singletons
• Monitor lock rule. An unlock on a monitor lock happens before every subsequent lock on
that same monitor lock.
• Volatile variable rule. A write to a volatile field happens before every subsequent read of
that same field.
• Transitivity. If A happens before B, and B happens before C, then A happens before C.
Shamelessly taken from: “Java theory and practice: Fixing the Java Memory Model, Part 2“,
Volatile piggybacking (1)
Shamelessly taken from Vitkor Klang’s Github: https://gist.github.com/viktorklang/2362563
• With high-level concurrency frameworks, you may
not have to worry about these issues (note: plain, vanilla thread
pools are not high level enough - very fragile technique anyway)
Volatile piggybacking (2)
• Repetitive exercise, I know, but why can’t we rely on
thread pools for memory consistence? They do have
locks internally! (I’ll promise, you’ll understand concurrent code a lot better if you
think about this!)
Think about happens-before
relationships with regards to
locks and multiple workers
(i.e. what are the release/
acquire pairs for your
A word on immutability (1)
• Java Memory Model treats final fields / val fields
specially (value must be assigned before the constructor returns and cannot be re-
Actors and the Java Memory Model. In most cases messages
are immutable, but if that message is not a properly constructed
immutable object, without a "happens before" rule, it would be
possible for the receiver to see partially initialized data structures
and possibly even values out of thin air (longs/doubles).
A word on immutability (2)
• Its state cannot be modified after construction (i.e. no
getters that return mutable objects, nothing passed to the constructor references mutable objects
held by this one, etc.)
• All fields are declared as final / val *
• It is properly constructed (i.e. the this reference doesn’t escape during
Thus, precisely defined notion of immutability
* Yes, java.lang.String is not immutable according to that definition.
hashCodes are cached and there actually is a data race in that method, but
it’s a benign one. So for all intents and purposes, java.lang.String can still be
considered an immutable class.
Tasks and thread pools
• Heterogenous tasks are annoying when you aim for
utilisation (bit theoretical though as it presumably averages out .. but .. )
• Dependent tasks cause even more issues (possibly even dead
locks, if it’s a bounded thread pool)
Task A Task B (10x Task A)Sequential:
Parallel: Task A
Task B (10x Task A)
Result: A whopping 9% speedup! (well, we still need to deduct
something for concurrency overhead ..)
Configuring thread pools (1)
(no such queue really exists,
but we’ll just think that way)
bounded - n unbounded
newSingleThreadExecutor bounded - 1
newCachedThreadExecutor unbounded SynchronousQueue
alternative invocation of
bounded - n bounded - m, m > n
alternative invocation of
bounded - n
The same implementation can exhibit radically different
behaviour depending on how you instantiate it.
Note: SynchronousQueues are not just LinkedBlockingQueues with capacity 1.
They’re more like rendezvous-channels in CSP.
Configuring thread pools (2)
• Client-run saturation policy means that overloading
causes tasks being pushed outward from the thread
pool (no more accepts, TCP might dismiss connections, etc.. which ultimately enables
clients as well to handle degradation - e.g. load balancing)
• For example, asynchronous loggers that don’t break
down when sh** hits the fan!
See also: Brian Goetz, “Java Concurrency in Practice”, Chapter 8.3
Visualising task queues
• Predefined tasks (nudge nudge, actors) will be used to process
different data (I don’t even need to “nudge nudge” here ..)
Payload 1Task A
Payload 2Task B
Payload 3Task A
Payload 4Task C
Payload 5Task A
Payload 6Task B
Payload 7Task A
Payload 8Task C
Thread 1 Thread 2 Thread 3 Thread 4
Spot the issue in this model!
Hint: Think “contention”, think
What could the solution look like?
Maintain the invariant that we’re only
allowed to process a message once
and only once!
Hint: It’s not non-blocking locking!
Organising task queues
• Make a distinction between tasks and data and do
some sensible partitioning
Thread 1 Thread 2 Thread 3 Thread 4
• Tasks now have message .. I mean ..
• n tasks with a queue each means 1/n load
per queue (if you add new kinds of tasks, this
scales, if you just add more messages not so, but
hold on to your thought!)
• Tasks can still be executed in parallel (i.e.
you don’t get away yet without synchronisation)
Does it really make a difference? (1)
• Comparison is silly and totally crazy, but it’s a bit like
the difference between these two pieces of code
(obviously neither is recommended ..)
• Apart from reduced contention, all kinds of localities
that you’re exploiting (cache friendliness, GC friendliness - new objects don’t
span over multiple threads, and so on and so forth)
In case you haven’t had enough background literature yet: http://gee.cs.oswego.edu/dl/papers/fj.pdf
Does it really make a difference? (2)
• In case you’re still not believing me, here’s a proof by
“Pics or it didn’t happen!”
• ForkJoin Pools organise tasks similarly, hence the comparison
Shamelessly taken from:“Scalability of ForkJoin Pool”,
One more thing!
• The missing commandment. Thou shall not schedule
two tasks at the same time, if they both need the same locks!
• How would the scheduler know? Well, here’s an
educated guess: If two tasks are the same task, they
will most likely also need the same locks!
• Executing an actor only once at a time therefore has
performance reasons (yes, it does make it easier as well to reason about it ..
but we wouldn’t want to appear lame ..)
• Conversely, if you write different actors, make sure
that they don’t use the same locks (not sure if this is a best-practice
in Akka, but it’s certainly true in Erlang)
• The devil’s in the detail and unfortunately some
knowledge of these details is required to design
• In particular, understanding the underlying issues will
hopefully help you with designing scalable Akka
applications (e.g. applying what you’ve heard, what can you do about too many
messages being queued up?)
• Concurrency is hard, yes, but isn’t that the beauty about it?
Not at all, but never mind!