High Performance Systems Without Tears - Scala Days Berlin 2018

HIGH PERFORMANCE SYSTEMS
WITHOUT TEARS
ZAHARI DICHEV
SCALA DAYS BERLIN 2018

EXCESSIVE OBJECT INSTANTIATION

THINGS TO KEEP IN MIND
▸ Actual object instantiation costs time
▸ Creates a lot more work for the Garbage Collector
▸ May introduce systemic pauses in your application
▸ Introduces non-determinism

THINGS TO KEEP IN MIND
▸ A fair amount of systems implement GC-free data
structures living entirely off-heap
▸ Some commercials solutions provide pause-free
GC implementations (Azul Zing)
▸ There are even proposals to introduce a No-Op
GC in Java

EXTRACTORS IN A NUTSHELL
▸ An object with an unapply method
▸ Takes an object and tries to give back its
components
▸ Useful in pattern matching, partial functions, etc
▸ A great tool for achieving brevity and
expressiveness

EXTRACTING BLACK ROOKS
case class Rook(x: Int,
y: Int,
isBlack: Boolean)
▸ We simply need to go through a board of rooks
▸ Match on all rooks that are black
▸ Extract the x and y coordinates

object BlackRook {
def unapply(rook: Rook): Option[(Int, Int)] =
if (rook.isBlack) Some((rook.x, rook.y)) else None
}

public unapply(Lextractors/Rook;)Lscala/Option;
L0
LINENUMBER 5 L0
ALOAD 1
INVOKEVIRTUAL extractors/Rook.isBlack ()Z
IFEQ L1
NEW scala/Some
DUP
NEW scala/Tuple2$mcII$sp
DUP
. . .
object BlackRook {
def unapply(rook: Rook): Option[(Int, Int)] =
if (rook.isBlack) Some((rook.x, rook.y)) else None
}

ALLOCATION FREE EXTRACTORS
▸ Name-based extractors introduced in 2.11
▸ Returning Option is no longer needed
▸ We need to return an object deﬁning two methods
def isEmpty: Boolean = . . .
def get: T = . . .

ALLOCATION-FREE EXAMPLE
object BlackRookNameBased {
class Extractor[T <: AnyRef](val extraction: T) extends AnyVal {
def isEmpty: Boolean = extraction eq null
def get: T = extraction
}
def unapply(rook: Rook): Extractor[(Int, Int)] =
if (rook.isBlack)
new Extractor((rook.x, rook.y))
else
new Extractor(null)
}

ALLOCATION-FREE EXAMPLE (BYTECODE)
public unapply(Lextractors/Rook;)Lscala/Tuple2;
L0
ALOAD 1
INVOKEVIRTUAL extractors/Rook.isBlack ()Z
IFEQ L1
L2
NEW scala/Tuple2$mcII$sp
DUP
ALOAD 1
INVOKEVIRTUAL extractors/Rook.x ()I
ALOAD 1
INVOKEVIRTUAL extractors/Rook.y ()I
INVOKESPECIAL scala/Tuple2$mcII$sp.<init> (II)V
GOTO L3
L1
. . .

EXECUTION TIME COMPARISON
Time(ms)
0
100
200
300
400
Board Size
512 1024 2048 4096
369
120
47
24
144
44
2418
Name-based
Default

GC STATS
▸ -XX:+PrintGCApplicationStoppedTime  
Enables printing of the duration of the pause (for
example, a GC pause) that lasted.
▸ -XX:+PrintGCApplicationConcurrentTime 
Enables printing of the time elapsed from the last pause
(for example, a GC pause).
▸ -XX:+PrintGCTimeStamps 
Enables printing of the duration of the pause (for
example, a GC pause) that lasted.

HEAP STATS (DEFAULT)
num #instances #bytes class name
----------------------------------------------
1: 262144 376704 scala.Tuple2$mcII$sp
2: 262144 6291456 extractors.Rook
3: 197796 3164736 java.lang.Integer
4: 11931 286344 java.lang.String
5: 17592 281472 scala.Some
num #instances #bytes class name
----------------------------------------------
1: 262144 376704 scala.Tuple2$mcII$sp
2: 262144 6291456 extractors.Rook
3: 197796 3164736 java.lang.Integer
4: 11949 286776 java.lang.String
5: 3108 174048 jdk.internal.org.objectweb.asm.Item
HEAP STATS (NAME BASED)

ALL MEMORY IS NOT CREATED EQUAL

HOW LATENCY IS MASKED
▸ Modern CPUs have a multitude of caches
▸ These caches vary in size and latency
▸ Main purpose is to mask latency and ensure our
CPU’s progress is not severely hindered by main
memory latency
▸ A cache miss is one of the most prominent
performance killers

HIERARCHY OF CACHES
▸ L1 cache - core-local cache split into separate 32K
data and 32K instruction caches. (1.5 ns)
▸ L2 - core-local cache of 256K in size. Contains both
data and instructions . (5 ns)
▸ L3 - Typically 6mb. Shared between cores. (16 - 25 ns)
▸ RAM - Large in size. (60 ns)

MEMORY HIERARCHY EXAMPLE
CORE L1 L2
CORE L1 L2
L3 MAIN MEMORY

MATRIX TRANSPOSITION
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16

1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED

def transpose: Matrix = {
val newMatrix = Matrix.empty(rows)
for {i <- 0 until rows}
for {j <- 0 until cols}
newMatrix.data.update(j + I * rows, this.data(i + j * cols))
newMatrix
}

▸ We have cache and memory
▸ Latency to memory is signiﬁcantly higher
▸ We load data in cache lines of size 32 bytes (4 longs)
▸ Cache size is one line
def transpose: Matrix = {
val newMatrix = Matrix.empty(rows)
for {i <- 0 until rows}
for {j <- 0 until cols}
newMatrix.data.update(j + I * rows, this.data(i + j * cols))
newMatrix
}

DATA ACCESSES
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED

BETTER ACCESS PATTERN
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

EXECUTION TIME COMPARISONTime(ms)
0
1750
3500
5250
7000
Matrix size (side)
2048 4096 8192 16384
6 800
1 360
351120
3 100
822
238115
Cache Friendly
Naive

TOOLS TO MONITOR CACHE STATS
▸ Intel V tune  
https://software.intel.com/en-us/intel-vtune-ampliﬁer-xe
▸ Likwid 
https://github.com/RRZE-HPC/likwid
▸ Perf Stat 
https://perf.wiki.kernel.org
▸ Intel's PCM 
https://github.com/opcm/pcm

INTEL OPEN PCM
Core (SKT) | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP
0 0 8184 K 1993 K 0.30 0.43 0.01 0.01 35
1 0 1448 K 1982 K 0.27 0.24 0.01 0.01 35
2 0 4550 K 7100 K 0.36 0.46 0.00 0.00 33
3 0 1191 K 1658 K 0.28 0.34 0.00 0.01 33
4 0 9316 K 12 M 0.24 0.12 0.01 0.01 30
5 0 1042 K 1451 K 0.28 0.26 0.01 0.01 30
6 0 6700 K 9294 K 0.28 0.37 0.00 0.01 25
7 0 1013 K 1527 K 0.34 0.27 0.01 0.01 25
——————————————————————————————————————————————————————————————————————————————————————

L2 CACHE MISSES
Cachemisees(L2)
0
150
300
450
600
Matrix size (side)
2048 4096 8192 16384
546
176
46
21
268
96
44
19
Cache Friendly
Naive

CACHE OBLIVIOUS ALGORITHMS
▸ A bit of a misleading name…
▸ An algorithm designed to take advantage of the
underlying memory hierarchy
▸ No need to know details of the cache (size, length
of the cache lines, etc.)
▸ Reduces the problem, so that it eventually ﬁts in
cache

SOME KNOWN APPROACHES
▸ Matrix multiplication - Strassen algorithm
▸ Matrix tranposition - Frigo’s transpose
▸ Tree traversal - van Emde Boas layout
▸ Hashing - blocked probing

EXAMPLE FROM THE DAY IN THE LIFE
▸ An event log
▸ Writer - writes events to the log in linear fashion
▸ Transformer - concurrently tails the log and
transforms the events give some predeﬁned
function
▸ Transformer is never ahead of the writer

A SIMPLE EVENT LOG
trait EventLog[T] {
def writeNext(ev: T): Boolean
def transformNext(f: T => T): Boolean
}

A SIMPLE EVENT LOG
var writerPos = 0L
var transfPos = 0L
val log = new Array[Int](logSize)

A SIMPLE EVENT LOG
def writeNext(ev: Int): Boolean = synchronized {
if (writerPos < transfPos) {
false
} else {
log(writerPos.toInt) = ev
writerPos += 1
true
}
}

A SIMPLE EVENT LOG
def transformNext(f: Int Int): Boolean = synchronized {
if (transfPos >= writerPos) {
false
} else {
val currentEvent = log(transfPos.toInt)
log(transfPos.toInt) = f(currentEvent)
transfPos += 1
true
}
}

DELETING ALL SENSITIVE DATA
log.transformNext(_ => 0)

SYNCHRONIZED STATS
Log Type L2 Miss (M) L3 Miss (M)
(M)
IPC Ops/s (M)
(million)Synchronized 1084 137 0.29 13.2
Lock free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5

LOCK-FREE IMPLEMENTATION
@volatile var writerPos = 0L
@volatile var transfPos = 0L

INSURING MEMORY VISIBILITY
▸ Introduces a happens-before relationship
▸ All changes prior to that have happened and are
visible to other threads
▸ Does not mean that values are read from main
memory
▸ Writes are applied to the L1 cache and ﬂow
through the cache subsystem

LOCK-FREE STATS
(M)
IPC Ops/s (M)
Lock-free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5

GOING ATOMIC
val writerPos = AtomicLong(0L)
val transfPos = AtomicLong(0L)
val log = new Array[Int](logSize)
def writeNext(ev: Int): Boolean = {
val currentWriterPos = writerPos.get
if (currentWriterPos < transfPos.get) {
false
} else {
log(currentWriterPos.toInt) = ev
writerPos.lazySet(currentWriterPos + 1)
true
}
}

LAZY SET STATS
(M)
IPC Ops/s (M)
Lock free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5

FALSE SHARING
▸ Two threads modifying independent variables
sharing the same cache line
▸ Often times depends on the layout of your objects
▸ Causes invalidation of cache lines and increased
coherency protocol trafﬁc
▸ Cache line is ping-pongs through L3 which has
signiﬁcant latency implications
▸ Can be even worse in case these threads are on a
different socket (crossing interconnects)

ENSURING COHERENCE (MESI)
INVALID
EXCLUSIVE
SHARED
MODIFIED
BR + BW
PR/S
BW
BW PR/~S
BR
PW
BW
PW BR
PW
PR + BR
PR + PW
PR
PR - processor read
PW - processor write
BR - observed bus read
BW - observed bus write
S - shared
~S - not shared

FALSE SHARING
Cache Line 1
64 Bytes
Cache Line 2
Cache Line 3
Cache Line N
…

FALSE SHARING
Cache Line 1
64 Bytes
Cache Line 2
Cache Line 3
Cache Line N
…
▸ Inspecting Java Object Layout 
https://github.com/ktoso/sbt-jol

PADDING TO AVOID FALSE SHARING
val writerPos = AtomicLong.withPadding(0, LeftRight128)
val transfPos = AtomicLong.withPadding(0, LeftRight128)

PADDED STATS
(M)
IPC Ops/s (M)
Lock free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5

AKKA MESSAGE LIFECYCLE
Sender
Sends a message
through ActorRef
ActorRef
Actor
Enqueues message
Schedules and runs
the mailbox
Executor Service
T1 T2 Tn
Dispatcher

TYPES OF AKKA DISPATCHERS
▸ Default Dispatcher - Used if no other speciﬁed
▸ Pinned Dispatcher - one thread per actor
▸ Calling Thread Dispatcher - for tests only

TYPES OF EXECUTORS
▸ ForkJoinPool - Relies on lock free work stealing
queues
▸ ThreadPoolExecutor - Uses Linked blocking queue
to distribute tasks

THREAD POOL EXECUTOR
External
Component
Submits a task
R1 R2 Rn
T1
Tn
T1
Dequeues and
executes task

LIMITATION: NO ACTOR-TO-THREAD AFFINITY
▸ Potentially causing CPU cache invalidation
▸ Lack of parameters allowing you to achieve ﬁne
grained control

AFFINITY POOL
External
Component
Submits a task
R1 R2 Rn
T1
R1 R2 Rn
T2
R1 R2 Rn
Tn
Queue
selector
Picks the queue
to submit to
Dequeue
tasks

FAIR DISTRIBUTION QUEUE SELECTOR
▸ Adaptive work assignment strategy
▸ Few actors - explicit mapping (fairer)
▸ More actors - consistent hashing (cheaper)

ADVANTAGES
▸ Less cache hits due to temporal locality
▸ Decreases contention
▸ Customisable queue selection

CLIENT ACTOR
class UserQueryActor(latch: CountDownLatch,
numQueries: Int,
numUsersInDB: Int) extends Actor {
private var left = numQueries
private val receivedUsers: mutable.Map[Int, User] = mutable.Map()
private val randGenerator = new Random()
override def receive: Receive = {
case u: User {
receivedUsers.put(u.userId, u)
if (left == 0) {
latch.countDown()
context stop self
} else {
sender() ! Request(randGenerator.nextInt(numUsersInDB))
}
left -= 1
}
}
}

SERVICE ACTOR
class UserServiceActor(userDb: Map[Int, User],
latch: CountDownLatch,
numQueries: Int) extends Actor {
private var left = numQueries
def receive = {
case Request(id)
userDb.get(id) match {
case Some(u) sender() ! u
case None
}
if (left == 0) {
latch.countDown()
context stop self
}
left -= 1
}
}

BENCHMARK RESULTSMsg/s(M)
0
1
2
3
4
5
6
dispatcher.throughput
1 5 50
5,2
3,3
1,4
5
4,4
3,7
5,4
4,84,7
Afﬁnity
Fork Join
Fixed Size

SO… IF YOU ARE ON A HOT CODEPATH
▸ Make sure you really are
▸ Measure everything (sbt-jmh, sbt-jol, perfstat …)
▸ Watch out for language features that can
introduce unintended allocations (e.g. pattern
matching)
▸ Use algorithms and data structures that are cache
friendly
▸ Use efﬁcient concurrency tools but try to not roll
your own: JCTools, Akka, Vert.x …

RESOURCES
▸ Name based extractors 
https://hseeberger.wordpress.com/2013/10/04/name-based-extractors-in-scala-2-11/
▸ Cache oblivious algorithms (MIT OCW) 
https://www.youtube.com/watch?v=CSqbjfCCLrU
▸ Lazy Set in detail 
http://psy-lob-saw.blogspot.bg/2012/12/atomiclazyset-is-performance-win-for.html
▸ Processor Counter Monitor 
https://github.com/opcm/pcm
▸ Memory Access Patterns 
https://mechanical-sympathy.blogspot.bg/2012/08/memory-access-patterns-are-
important.html
▸ False Sharing 
https://mechanical-sympathy.blogspot.bg/2011/07/false-sharing.html 
https://mechanical-sympathy.blogspot.bg/2013/02/cpu-cache-ﬂushing-fallacy.html
▸ Slides and code 
https://github.com/zaharidichev/scala-days-2018-berlin

High Performance Systems Without Tears - Scala Days Berlin 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High Performance Systems Without Tears - Scala Days Berlin 2018

Similar to High Performance Systems Without Tears - Scala Days Berlin 2018 (20)

Recently uploaded

Recently uploaded (20)

High Performance Systems Without Tears - Scala Days Berlin 2018