HIGH PERFORMANCE SYSTEMS
WITHOUT TEARS
ZAHARI DICHEV
SCALA DAYS BERLIN 2018
EXCESSIVE OBJECT INSTANTIATION
THINGS TO KEEP IN MIND
▸ Actual object instantiation costs time
▸ Creates a lot more work for the Garbage Collector
▸ May introduce systemic pauses in your application
▸ Introduces non-determinism
THINGS TO KEEP IN MIND
▸ A fair amount of systems implement GC-free data
structures living entirely off-heap
▸ Some commercials solutions provide pause-free
GC implementations (Azul Zing)
▸ There are even proposals to introduce a No-Op
GC in Java
EXTRACTOR OBJECTS
EXTRACTORS IN A NUTSHELL
▸ An object with an unapply method
▸ Takes an object and tries to give back its
components
▸ Useful in pattern matching, partial functions, etc
▸ A great tool for achieving brevity and
expressiveness
EXTRACTING BLACK ROOKS
case class Rook(x: Int,
y: Int,
isBlack: Boolean)
▸ We simply need to go through a board of rooks
▸ Match on all rooks that are black
▸ Extract the x and y coordinates
EXTRACTING BLACK ROOKS
object BlackRook {
def unapply(rook: Rook): Option[(Int, Int)] =
if (rook.isBlack) Some((rook.x, rook.y)) else None
}
EXTRACTING BLACK ROOKS
public unapply(Lextractors/Rook;)Lscala/Option;
L0
LINENUMBER 5 L0
ALOAD 1
INVOKEVIRTUAL extractors/Rook.isBlack ()Z
IFEQ L1
NEW scala/Some
DUP
NEW scala/Tuple2$mcII$sp
DUP
. . .
object BlackRook {
def unapply(rook: Rook): Option[(Int, Int)] =
if (rook.isBlack) Some((rook.x, rook.y)) else None
}
EXTRACTING BLACK ROOKS
public unapply(Lextractors/Rook;)Lscala/Option;
L0
LINENUMBER 5 L0
ALOAD 1
INVOKEVIRTUAL extractors/Rook.isBlack ()Z
IFEQ L1
NEW scala/Some
DUP
NEW scala/Tuple2$mcII$sp
DUP
. . .
object BlackRook {
def unapply(rook: Rook): Option[(Int, Int)] =
if (rook.isBlack) Some((rook.x, rook.y)) else None
}
ALLOCATION FREE EXTRACTORS
▸ Name-based extractors introduced in 2.11
▸ Returning Option is no longer needed
▸ We need to return an object defining two methods
def isEmpty: Boolean = . . .
def get: T = . . .
ALLOCATION-FREE EXAMPLE
object BlackRookNameBased {
class Extractor[T <: AnyRef](val extraction: T) extends AnyVal {
def isEmpty: Boolean = extraction eq null
def get: T = extraction
}
def unapply(rook: Rook): Extractor[(Int, Int)] =
if (rook.isBlack)
new Extractor((rook.x, rook.y))
else
new Extractor(null)
}
ALLOCATION-FREE EXAMPLE (BYTECODE)
public unapply(Lextractors/Rook;)Lscala/Tuple2;
L0
ALOAD 1
INVOKEVIRTUAL extractors/Rook.isBlack ()Z
IFEQ L1
L2
NEW scala/Tuple2$mcII$sp
DUP
ALOAD 1
INVOKEVIRTUAL extractors/Rook.x ()I
ALOAD 1
INVOKEVIRTUAL extractors/Rook.y ()I
INVOKESPECIAL scala/Tuple2$mcII$sp.<init> (II)V
GOTO L3
L1
. . .
ALLOCATION-FREE EXAMPLE (BYTECODE)
public unapply(Lextractors/Rook;)Lscala/Tuple2;
L0
ALOAD 1
INVOKEVIRTUAL extractors/Rook.isBlack ()Z
IFEQ L1
L2
NEW scala/Tuple2$mcII$sp
DUP
ALOAD 1
INVOKEVIRTUAL extractors/Rook.x ()I
ALOAD 1
INVOKEVIRTUAL extractors/Rook.y ()I
INVOKESPECIAL scala/Tuple2$mcII$sp.<init> (II)V
GOTO L3
L1
. . .
EXECUTION TIME COMPARISON
Time(ms)
0
100
200
300
400
Board Size
512 1024 2048 4096
369
120
47
24
144
44
2418
Name-based
Default
GC STATS
▸ -XX:+PrintGCApplicationStoppedTime 

Enables printing of the duration of the pause (for
example, a GC pause) that lasted.
▸ -XX:+PrintGCApplicationConcurrentTime

Enables printing of the time elapsed from the last pause
(for example, a GC pause).
▸ -XX:+PrintGCTimeStamps

Enables printing of the duration of the pause (for
example, a GC pause) that lasted.
HEAP STATS (DEFAULT)
num #instances #bytes class name
----------------------------------------------
1: 262144 376704 scala.Tuple2$mcII$sp
2: 262144 6291456 extractors.Rook
3: 197796 3164736 java.lang.Integer
4: 11931 286344 java.lang.String
5: 17592 281472 scala.Some
num #instances #bytes class name
----------------------------------------------
1: 262144 376704 scala.Tuple2$mcII$sp
2: 262144 6291456 extractors.Rook
3: 197796 3164736 java.lang.Integer
4: 11949 286776 java.lang.String
5: 3108 174048 jdk.internal.org.objectweb.asm.Item
HEAP STATS (NAME BASED)
ALL MEMORY IS NOT CREATED EQUAL
HOW LATENCY IS MASKED
▸ Modern CPUs have a multitude of caches
▸ These caches vary in size and latency
▸ Main purpose is to mask latency and ensure our
CPU’s progress is not severely hindered by main
memory latency
▸ A cache miss is one of the most prominent
performance killers
HIERARCHY OF CACHES
▸ L1 cache -  core-local cache split into separate 32K
data and 32K instruction caches. (1.5 ns)
▸ L2 - core-local cache of 256K in size. Contains both
data and instructions . (5 ns)
▸ L3 - Typically 6mb. Shared between cores. (16 - 25 ns)
▸ RAM - Large in size. (60 ns)
MEMORY HIERARCHY EXAMPLE
CORE L1 L2
CORE L1 L2
L3 MAIN MEMORY
MATRIX TRANSPOSITION
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
MATRIX TRANSPOSITION
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
MATRIX TRANSPOSITION
def transpose: Matrix = {
val newMatrix = Matrix.empty(rows)
for {i <- 0 until rows}
for {j <- 0 until cols}
newMatrix.data.update(j + I * rows, this.data(i + j * cols))
newMatrix
}
MATRIX TRANSPOSITION
▸ We have cache and memory
▸ Latency to memory is significantly higher
▸ We load data in cache lines of size 32 bytes (4 longs)
▸ Cache size is one line
def transpose: Matrix = {
val newMatrix = Matrix.empty(rows)
for {i <- 0 until rows}
for {j <- 0 until cols}
newMatrix.data.update(j + I * rows, this.data(i + j * cols))
newMatrix
}
DATA ACCESSES
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
DATA ACCESSES
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
DATA ACCESSES
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
DATA ACCESSES
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
BETTER ACCESS PATTERN
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
BETTER ACCESS PATTERN
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
BETTER ACCESS PATTERN
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
BETTER ACCESS PATTERN
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
TRANSPOSED
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
EXECUTION TIME COMPARISONTime(ms)
0
1750
3500
5250
7000
Matrix size (side)
2048 4096 8192 16384
6 800
1 360
351120
3 100
822
238115
Cache Friendly
Naive
TOOLS TO MONITOR CACHE STATS
▸ Intel V tune 

https://software.intel.com/en-us/intel-vtune-amplifier-xe
▸ Likwid

https://github.com/RRZE-HPC/likwid
▸ Perf Stat

https://perf.wiki.kernel.org
▸ Intel's PCM

https://github.com/opcm/pcm
INTEL OPEN PCM
Core (SKT) | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP
0 0 8184 K 1993 K 0.30 0.43 0.01 0.01 35
1 0 1448 K 1982 K 0.27 0.24 0.01 0.01 35
2 0 4550 K 7100 K 0.36 0.46 0.00 0.00 33
3 0 1191 K 1658 K 0.28 0.34 0.00 0.01 33
4 0 9316 K 12 M 0.24 0.12 0.01 0.01 30
5 0 1042 K 1451 K 0.28 0.26 0.01 0.01 30
6 0 6700 K 9294 K 0.28 0.37 0.00 0.01 25
7 0 1013 K 1527 K 0.34 0.27 0.01 0.01 25
——————————————————————————————————————————————————————————————————————————————————————
INTEL OPEN PCM
Core (SKT) | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP
0 0 8184 K 1993 K 0.30 0.43 0.01 0.01 35
1 0 1448 K 1982 K 0.27 0.24 0.01 0.01 35
2 0 4550 K 7100 K 0.36 0.46 0.00 0.00 33
3 0 1191 K 1658 K 0.28 0.34 0.00 0.01 33
4 0 9316 K 12 M 0.24 0.12 0.01 0.01 30
5 0 1042 K 1451 K 0.28 0.26 0.01 0.01 30
6 0 6700 K 9294 K 0.28 0.37 0.00 0.01 25
7 0 1013 K 1527 K 0.34 0.27 0.01 0.01 25
——————————————————————————————————————————————————————————————————————————————————————
L2 CACHE MISSES
Cachemisees(L2)
0
150
300
450
600
Matrix size (side)
2048 4096 8192 16384
546
176
46
21
268
96
44
19
Cache Friendly
Naive
CACHE OBLIVIOUS ALGORITHMS
▸ A bit of a misleading name…
▸ An algorithm designed to take advantage of the
underlying memory hierarchy
▸ No need to know details of the cache (size, length
of the cache lines, etc.)
▸ Reduces the problem, so that it eventually fits in
cache
SOME KNOWN APPROACHES
▸ Matrix multiplication - Strassen algorithm
▸ Matrix tranposition - Frigo’s transpose
▸ Tree traversal - van Emde Boas layout
▸ Hashing - blocked probing
SYNCHRONISATION HAS ITS PRICE
EXAMPLE FROM THE DAY IN THE LIFE
▸ An event log
▸ Writer - writes events to the log in linear fashion
▸ Transformer - concurrently tails the log and
transforms the events give some predefined
function
▸ Transformer is never ahead of the writer
A SIMPLE EVENT LOG
trait EventLog[T] {
def writeNext(ev: T): Boolean
def transformNext(f: T => T): Boolean
}
A SIMPLE EVENT LOG
var writerPos = 0L
var transfPos = 0L
val log = new Array[Int](logSize)
A SIMPLE EVENT LOG
def writeNext(ev: Int): Boolean = synchronized {
if (writerPos < transfPos) {
false
} else {
log(writerPos.toInt) = ev
writerPos += 1
true
}
}
A SIMPLE EVENT LOG
def transformNext(f: Int Int): Boolean = synchronized {
if (transfPos >= writerPos) {
false
} else {
val currentEvent = log(transfPos.toInt)
log(transfPos.toInt) = f(currentEvent)
transfPos += 1
true
}
}
DELETING ALL SENSITIVE DATA
log.transformNext(_ => 0)
READY FOR GDRP
SYNCHRONIZED STATS
Log Type L2 Miss (M) L3 Miss (M)
(M)
IPC Ops/s (M)
(million)Synchronized 1084 137 0.29 13.2
Lock free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5
LOCK-FREE IMPLEMENTATION
@volatile var writerPos = 0L
@volatile var transfPos = 0L
INSURING MEMORY VISIBILITY
▸ Introduces a happens-before relationship
▸ All changes prior to that have happened and are
visible to other threads
▸ Does not mean that values are read from main
memory
▸ Writes are applied to the L1 cache and flow
through the cache subsystem
LOCK-FREE STATS
Log Type L2 Miss (M) L3 Miss (M)
(M)
IPC Ops/s (M)
(million)Synchronized 1084 137 0.29 13.2
Lock-free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5
GOING ATOMIC
val writerPos = AtomicLong(0L)
val transfPos = AtomicLong(0L)
val log = new Array[Int](logSize)
def writeNext(ev: Int): Boolean = {
val currentWriterPos = writerPos.get
if (currentWriterPos < transfPos.get) {
false
} else {
log(currentWriterPos.toInt) = ev
writerPos.lazySet(currentWriterPos + 1)
true
}
}
LAZY SET STATS
Log Type L2 Miss (M) L3 Miss (M)
(M)
IPC Ops/s (M)
(million)Synchronized 1084 137 0.29 13.2
Lock free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5
FALSE SHARING
▸ Two threads modifying independent variables
sharing the same cache line
▸ Often times depends on the layout of your objects
▸ Causes invalidation of cache lines and increased
coherency protocol traffic
▸ Cache line is ping-pongs through L3 which has
significant latency implications
▸ Can be even worse in case these threads are on a
different socket (crossing interconnects)
ENSURING COHERENCE (MESI)
INVALID
EXCLUSIVE
SHARED
MODIFIED
BR + BW
PR/S
BW
BW PR/~S
BR
PW
BW
PW BR
PW
PR + BR
PR + PW
PR
PR - processor read
PW - processor write
BR - observed bus read
BW - observed bus write
S - shared
~S - not shared
FALSE SHARING
@volatile var writerPos = 0L
@volatile var transfPos = 0L
Cache Line 1
64 Bytes
Cache Line 2
Cache Line 3
Cache Line N
…
FALSE SHARING
@volatile var writerPos = 0L
@volatile var transfPos = 0L
Cache Line 1
64 Bytes
Cache Line 2
Cache Line 3
Cache Line N
…
▸ Inspecting Java Object Layout

https://github.com/ktoso/sbt-jol
PADDING TO AVOID FALSE SHARING
val writerPos = AtomicLong.withPadding(0, LeftRight128)
val transfPos = AtomicLong.withPadding(0, LeftRight128)
PADDED STATS
Log Type L2 Miss (M) L3 Miss (M)
(M)
IPC Ops/s (M)
(million)Synchronized 1084 137 0.29 13.2
Lock free 357 156 0.42 17.2
Lazy set 304 73 0.8 42.6
Padded 211 50 1.4 76.5
USE CASE FROM REAL LIFE
AKKA MESSAGE LIFECYCLE
Sender
Sends a message
through ActorRef
ActorRef
Actor
Enqueues message
Schedules and runs
the mailbox
Executor Service
T1 T2 Tn
Dispatcher
TYPES OF AKKA DISPATCHERS
▸ Default Dispatcher - Used if no other specified
▸ Pinned Dispatcher - one thread per actor
▸ Calling Thread Dispatcher - for tests only
TYPES OF EXECUTORS
▸ ForkJoinPool - Relies on lock free work stealing
queues
▸ ThreadPoolExecutor - Uses Linked blocking queue
to distribute tasks
THREAD POOL EXECUTOR
External
Component
Submits a task
R1 R2 Rn
T1
Tn
T1
Dequeues and
executes task
LIMITATION: NO ACTOR-TO-THREAD AFFINITY
▸ Potentially causing CPU cache invalidation
▸ Lack of parameters allowing you to achieve fine
grained control
AFFINITY POOL
External
Component
Submits a task
R1 R2 Rn
T1
R1 R2 Rn
T2
R1 R2 Rn
Tn
Queue
selector
Picks the queue
to submit to
Dequeue
tasks
FAIR DISTRIBUTION QUEUE SELECTOR
▸ Adaptive work assignment strategy
▸ Few actors - explicit mapping (fairer)
▸ More actors - consistent hashing (cheaper)
ADVANTAGES
▸ Less cache hits due to temporal locality
▸ Decreases contention
▸ Customisable queue selection
CLIENT ACTOR
class UserQueryActor(latch: CountDownLatch,
numQueries: Int,
numUsersInDB: Int) extends Actor {
private var left = numQueries
private val receivedUsers: mutable.Map[Int, User] = mutable.Map()
private val randGenerator = new Random()
override def receive: Receive = {
case u: User {
receivedUsers.put(u.userId, u)
if (left == 0) {
latch.countDown()
context stop self
} else {
sender() ! Request(randGenerator.nextInt(numUsersInDB))
}
left -= 1
}
}
}
SERVICE ACTOR
class UserServiceActor(userDb: Map[Int, User],
latch: CountDownLatch,
numQueries: Int) extends Actor {
private var left = numQueries
def receive = {
case Request(id)
userDb.get(id) match {
case Some(u) sender() ! u
case None
}
if (left == 0) {
latch.countDown()
context stop self
}
left -= 1
}
}
BENCHMARK RESULTSMsg/s(M)
0
1
2
3
4
5
6
dispatcher.throughput
1 5 50
5,2
3,3
1,4
5
4,4
3,7
5,4
4,84,7
Affinity
Fork Join
Fixed Size
SO… IF YOU ARE ON A HOT CODEPATH
▸ Make sure you really are
▸ Measure everything (sbt-jmh, sbt-jol, perfstat …)
▸ Watch out for language features that can
introduce unintended allocations (e.g. pattern
matching)
▸ Use algorithms and data structures that are cache
friendly
▸ Use efficient concurrency tools but try to not roll
your own: JCTools, Akka, Vert.x …
RESOURCES
▸ Name based extractors

https://hseeberger.wordpress.com/2013/10/04/name-based-extractors-in-scala-2-11/
▸ Cache oblivious algorithms (MIT OCW)

https://www.youtube.com/watch?v=CSqbjfCCLrU
▸ Lazy Set in detail

http://psy-lob-saw.blogspot.bg/2012/12/atomiclazyset-is-performance-win-for.html
▸ Processor Counter Monitor

https://github.com/opcm/pcm
▸ Memory Access Patterns

https://mechanical-sympathy.blogspot.bg/2012/08/memory-access-patterns-are-
important.html
▸ False Sharing

https://mechanical-sympathy.blogspot.bg/2011/07/false-sharing.html

https://mechanical-sympathy.blogspot.bg/2013/02/cpu-cache-flushing-fallacy.html
▸ Slides and code

https://github.com/zaharidichev/scala-days-2018-berlin

High Performance Systems Without Tears - Scala Days Berlin 2018

  • 1.
    HIGH PERFORMANCE SYSTEMS WITHOUTTEARS ZAHARI DICHEV SCALA DAYS BERLIN 2018
  • 2.
  • 3.
    THINGS TO KEEPIN MIND ▸ Actual object instantiation costs time ▸ Creates a lot more work for the Garbage Collector ▸ May introduce systemic pauses in your application ▸ Introduces non-determinism
  • 4.
    THINGS TO KEEPIN MIND ▸ A fair amount of systems implement GC-free data structures living entirely off-heap ▸ Some commercials solutions provide pause-free GC implementations (Azul Zing) ▸ There are even proposals to introduce a No-Op GC in Java
  • 5.
  • 6.
    EXTRACTORS IN ANUTSHELL ▸ An object with an unapply method ▸ Takes an object and tries to give back its components ▸ Useful in pattern matching, partial functions, etc ▸ A great tool for achieving brevity and expressiveness
  • 7.
    EXTRACTING BLACK ROOKS caseclass Rook(x: Int, y: Int, isBlack: Boolean) ▸ We simply need to go through a board of rooks ▸ Match on all rooks that are black ▸ Extract the x and y coordinates
  • 8.
    EXTRACTING BLACK ROOKS objectBlackRook { def unapply(rook: Rook): Option[(Int, Int)] = if (rook.isBlack) Some((rook.x, rook.y)) else None }
  • 9.
    EXTRACTING BLACK ROOKS publicunapply(Lextractors/Rook;)Lscala/Option; L0 LINENUMBER 5 L0 ALOAD 1 INVOKEVIRTUAL extractors/Rook.isBlack ()Z IFEQ L1 NEW scala/Some DUP NEW scala/Tuple2$mcII$sp DUP . . . object BlackRook { def unapply(rook: Rook): Option[(Int, Int)] = if (rook.isBlack) Some((rook.x, rook.y)) else None }
  • 10.
    EXTRACTING BLACK ROOKS publicunapply(Lextractors/Rook;)Lscala/Option; L0 LINENUMBER 5 L0 ALOAD 1 INVOKEVIRTUAL extractors/Rook.isBlack ()Z IFEQ L1 NEW scala/Some DUP NEW scala/Tuple2$mcII$sp DUP . . . object BlackRook { def unapply(rook: Rook): Option[(Int, Int)] = if (rook.isBlack) Some((rook.x, rook.y)) else None }
  • 11.
    ALLOCATION FREE EXTRACTORS ▸Name-based extractors introduced in 2.11 ▸ Returning Option is no longer needed ▸ We need to return an object defining two methods def isEmpty: Boolean = . . . def get: T = . . .
  • 12.
    ALLOCATION-FREE EXAMPLE object BlackRookNameBased{ class Extractor[T <: AnyRef](val extraction: T) extends AnyVal { def isEmpty: Boolean = extraction eq null def get: T = extraction } def unapply(rook: Rook): Extractor[(Int, Int)] = if (rook.isBlack) new Extractor((rook.x, rook.y)) else new Extractor(null) }
  • 13.
    ALLOCATION-FREE EXAMPLE (BYTECODE) publicunapply(Lextractors/Rook;)Lscala/Tuple2; L0 ALOAD 1 INVOKEVIRTUAL extractors/Rook.isBlack ()Z IFEQ L1 L2 NEW scala/Tuple2$mcII$sp DUP ALOAD 1 INVOKEVIRTUAL extractors/Rook.x ()I ALOAD 1 INVOKEVIRTUAL extractors/Rook.y ()I INVOKESPECIAL scala/Tuple2$mcII$sp.<init> (II)V GOTO L3 L1 . . .
  • 14.
    ALLOCATION-FREE EXAMPLE (BYTECODE) publicunapply(Lextractors/Rook;)Lscala/Tuple2; L0 ALOAD 1 INVOKEVIRTUAL extractors/Rook.isBlack ()Z IFEQ L1 L2 NEW scala/Tuple2$mcII$sp DUP ALOAD 1 INVOKEVIRTUAL extractors/Rook.x ()I ALOAD 1 INVOKEVIRTUAL extractors/Rook.y ()I INVOKESPECIAL scala/Tuple2$mcII$sp.<init> (II)V GOTO L3 L1 . . .
  • 15.
    EXECUTION TIME COMPARISON Time(ms) 0 100 200 300 400 BoardSize 512 1024 2048 4096 369 120 47 24 144 44 2418 Name-based Default
  • 16.
    GC STATS ▸ -XX:+PrintGCApplicationStoppedTime
 Enables printing of the duration of the pause (for example, a GC pause) that lasted. ▸ -XX:+PrintGCApplicationConcurrentTime
 Enables printing of the time elapsed from the last pause (for example, a GC pause). ▸ -XX:+PrintGCTimeStamps
 Enables printing of the duration of the pause (for example, a GC pause) that lasted.
  • 17.
    HEAP STATS (DEFAULT) num#instances #bytes class name ---------------------------------------------- 1: 262144 376704 scala.Tuple2$mcII$sp 2: 262144 6291456 extractors.Rook 3: 197796 3164736 java.lang.Integer 4: 11931 286344 java.lang.String 5: 17592 281472 scala.Some num #instances #bytes class name ---------------------------------------------- 1: 262144 376704 scala.Tuple2$mcII$sp 2: 262144 6291456 extractors.Rook 3: 197796 3164736 java.lang.Integer 4: 11949 286776 java.lang.String 5: 3108 174048 jdk.internal.org.objectweb.asm.Item HEAP STATS (NAME BASED)
  • 18.
    ALL MEMORY ISNOT CREATED EQUAL
  • 19.
    HOW LATENCY ISMASKED ▸ Modern CPUs have a multitude of caches ▸ These caches vary in size and latency ▸ Main purpose is to mask latency and ensure our CPU’s progress is not severely hindered by main memory latency ▸ A cache miss is one of the most prominent performance killers
  • 20.
    HIERARCHY OF CACHES ▸L1 cache -  core-local cache split into separate 32K data and 32K instruction caches. (1.5 ns) ▸ L2 - core-local cache of 256K in size. Contains both data and instructions . (5 ns) ▸ L3 - Typically 6mb. Shared between cores. (16 - 25 ns) ▸ RAM - Large in size. (60 ns)
  • 21.
    MEMORY HIERARCHY EXAMPLE COREL1 L2 CORE L1 L2 L3 MAIN MEMORY
  • 22.
    MATRIX TRANSPOSITION 1 23 4 5 6 7 8 9 10 11 12 13 14 15 16
  • 23.
    MATRIX TRANSPOSITION 1 23 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED
  • 24.
    MATRIX TRANSPOSITION def transpose:Matrix = { val newMatrix = Matrix.empty(rows) for {i <- 0 until rows} for {j <- 0 until cols} newMatrix.data.update(j + I * rows, this.data(i + j * cols)) newMatrix }
  • 25.
    MATRIX TRANSPOSITION ▸ Wehave cache and memory ▸ Latency to memory is significantly higher ▸ We load data in cache lines of size 32 bytes (4 longs) ▸ Cache size is one line def transpose: Matrix = { val newMatrix = Matrix.empty(rows) for {i <- 0 until rows} for {j <- 0 until cols} newMatrix.data.update(j + I * rows, this.data(i + j * cols)) newMatrix }
  • 26.
    DATA ACCESSES 1 59 13 2 6 10 14 3 7 11 15 4 8 12 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED
  • 27.
    DATA ACCESSES 1 59 13 2 6 10 14 3 7 11 15 4 8 12 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED
  • 28.
    DATA ACCESSES 1 59 13 2 6 10 14 3 7 11 15 4 8 12 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED
  • 29.
    DATA ACCESSES 1 59 13 2 6 10 14 3 7 11 15 4 8 12 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED
  • 30.
    BETTER ACCESS PATTERN 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
  • 31.
    BETTER ACCESS PATTERN 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
  • 32.
    BETTER ACCESS PATTERN 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
  • 33.
    BETTER ACCESS PATTERN 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 TRANSPOSED 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
  • 34.
    EXECUTION TIME COMPARISONTime(ms) 0 1750 3500 5250 7000 Matrixsize (side) 2048 4096 8192 16384 6 800 1 360 351120 3 100 822 238115 Cache Friendly Naive
  • 35.
    TOOLS TO MONITORCACHE STATS ▸ Intel V tune 
 https://software.intel.com/en-us/intel-vtune-amplifier-xe ▸ Likwid
 https://github.com/RRZE-HPC/likwid ▸ Perf Stat
 https://perf.wiki.kernel.org ▸ Intel's PCM
 https://github.com/opcm/pcm
  • 36.
    INTEL OPEN PCM Core(SKT) | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP 0 0 8184 K 1993 K 0.30 0.43 0.01 0.01 35 1 0 1448 K 1982 K 0.27 0.24 0.01 0.01 35 2 0 4550 K 7100 K 0.36 0.46 0.00 0.00 33 3 0 1191 K 1658 K 0.28 0.34 0.00 0.01 33 4 0 9316 K 12 M 0.24 0.12 0.01 0.01 30 5 0 1042 K 1451 K 0.28 0.26 0.01 0.01 30 6 0 6700 K 9294 K 0.28 0.37 0.00 0.01 25 7 0 1013 K 1527 K 0.34 0.27 0.01 0.01 25 ——————————————————————————————————————————————————————————————————————————————————————
  • 37.
    INTEL OPEN PCM Core(SKT) | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP 0 0 8184 K 1993 K 0.30 0.43 0.01 0.01 35 1 0 1448 K 1982 K 0.27 0.24 0.01 0.01 35 2 0 4550 K 7100 K 0.36 0.46 0.00 0.00 33 3 0 1191 K 1658 K 0.28 0.34 0.00 0.01 33 4 0 9316 K 12 M 0.24 0.12 0.01 0.01 30 5 0 1042 K 1451 K 0.28 0.26 0.01 0.01 30 6 0 6700 K 9294 K 0.28 0.37 0.00 0.01 25 7 0 1013 K 1527 K 0.34 0.27 0.01 0.01 25 ——————————————————————————————————————————————————————————————————————————————————————
  • 38.
    L2 CACHE MISSES Cachemisees(L2) 0 150 300 450 600 Matrixsize (side) 2048 4096 8192 16384 546 176 46 21 268 96 44 19 Cache Friendly Naive
  • 39.
    CACHE OBLIVIOUS ALGORITHMS ▸A bit of a misleading name… ▸ An algorithm designed to take advantage of the underlying memory hierarchy ▸ No need to know details of the cache (size, length of the cache lines, etc.) ▸ Reduces the problem, so that it eventually fits in cache
  • 40.
    SOME KNOWN APPROACHES ▸Matrix multiplication - Strassen algorithm ▸ Matrix tranposition - Frigo’s transpose ▸ Tree traversal - van Emde Boas layout ▸ Hashing - blocked probing
  • 41.
  • 42.
    EXAMPLE FROM THEDAY IN THE LIFE ▸ An event log ▸ Writer - writes events to the log in linear fashion ▸ Transformer - concurrently tails the log and transforms the events give some predefined function ▸ Transformer is never ahead of the writer
  • 43.
    A SIMPLE EVENTLOG trait EventLog[T] { def writeNext(ev: T): Boolean def transformNext(f: T => T): Boolean }
  • 44.
    A SIMPLE EVENTLOG var writerPos = 0L var transfPos = 0L val log = new Array[Int](logSize)
  • 45.
    A SIMPLE EVENTLOG def writeNext(ev: Int): Boolean = synchronized { if (writerPos < transfPos) { false } else { log(writerPos.toInt) = ev writerPos += 1 true } }
  • 46.
    A SIMPLE EVENTLOG def transformNext(f: Int Int): Boolean = synchronized { if (transfPos >= writerPos) { false } else { val currentEvent = log(transfPos.toInt) log(transfPos.toInt) = f(currentEvent) transfPos += 1 true } }
  • 47.
    DELETING ALL SENSITIVEDATA log.transformNext(_ => 0)
  • 48.
  • 49.
    SYNCHRONIZED STATS Log TypeL2 Miss (M) L3 Miss (M) (M) IPC Ops/s (M) (million)Synchronized 1084 137 0.29 13.2 Lock free 357 156 0.42 17.2 Lazy set 304 73 0.8 42.6 Padded 211 50 1.4 76.5
  • 50.
    LOCK-FREE IMPLEMENTATION @volatile varwriterPos = 0L @volatile var transfPos = 0L
  • 51.
    INSURING MEMORY VISIBILITY ▸Introduces a happens-before relationship ▸ All changes prior to that have happened and are visible to other threads ▸ Does not mean that values are read from main memory ▸ Writes are applied to the L1 cache and flow through the cache subsystem
  • 52.
    LOCK-FREE STATS Log TypeL2 Miss (M) L3 Miss (M) (M) IPC Ops/s (M) (million)Synchronized 1084 137 0.29 13.2 Lock-free 357 156 0.42 17.2 Lazy set 304 73 0.8 42.6 Padded 211 50 1.4 76.5
  • 53.
    GOING ATOMIC val writerPos= AtomicLong(0L) val transfPos = AtomicLong(0L) val log = new Array[Int](logSize) def writeNext(ev: Int): Boolean = { val currentWriterPos = writerPos.get if (currentWriterPos < transfPos.get) { false } else { log(currentWriterPos.toInt) = ev writerPos.lazySet(currentWriterPos + 1) true } }
  • 54.
    LAZY SET STATS LogType L2 Miss (M) L3 Miss (M) (M) IPC Ops/s (M) (million)Synchronized 1084 137 0.29 13.2 Lock free 357 156 0.42 17.2 Lazy set 304 73 0.8 42.6 Padded 211 50 1.4 76.5
  • 55.
    FALSE SHARING ▸ Twothreads modifying independent variables sharing the same cache line ▸ Often times depends on the layout of your objects ▸ Causes invalidation of cache lines and increased coherency protocol traffic ▸ Cache line is ping-pongs through L3 which has significant latency implications ▸ Can be even worse in case these threads are on a different socket (crossing interconnects)
  • 56.
    ENSURING COHERENCE (MESI) INVALID EXCLUSIVE SHARED MODIFIED BR+ BW PR/S BW BW PR/~S BR PW BW PW BR PW PR + BR PR + PW PR PR - processor read PW - processor write BR - observed bus read BW - observed bus write S - shared ~S - not shared
  • 57.
    FALSE SHARING @volatile varwriterPos = 0L @volatile var transfPos = 0L Cache Line 1 64 Bytes Cache Line 2 Cache Line 3 Cache Line N …
  • 58.
    FALSE SHARING @volatile varwriterPos = 0L @volatile var transfPos = 0L Cache Line 1 64 Bytes Cache Line 2 Cache Line 3 Cache Line N … ▸ Inspecting Java Object Layout
 https://github.com/ktoso/sbt-jol
  • 59.
    PADDING TO AVOIDFALSE SHARING val writerPos = AtomicLong.withPadding(0, LeftRight128) val transfPos = AtomicLong.withPadding(0, LeftRight128)
  • 60.
    PADDED STATS Log TypeL2 Miss (M) L3 Miss (M) (M) IPC Ops/s (M) (million)Synchronized 1084 137 0.29 13.2 Lock free 357 156 0.42 17.2 Lazy set 304 73 0.8 42.6 Padded 211 50 1.4 76.5
  • 61.
    USE CASE FROMREAL LIFE
  • 62.
    AKKA MESSAGE LIFECYCLE Sender Sendsa message through ActorRef ActorRef Actor Enqueues message Schedules and runs the mailbox Executor Service T1 T2 Tn Dispatcher
  • 63.
    TYPES OF AKKADISPATCHERS ▸ Default Dispatcher - Used if no other specified ▸ Pinned Dispatcher - one thread per actor ▸ Calling Thread Dispatcher - for tests only
  • 64.
    TYPES OF EXECUTORS ▸ForkJoinPool - Relies on lock free work stealing queues ▸ ThreadPoolExecutor - Uses Linked blocking queue to distribute tasks
  • 65.
    THREAD POOL EXECUTOR External Component Submitsa task R1 R2 Rn T1 Tn T1 Dequeues and executes task
  • 66.
    LIMITATION: NO ACTOR-TO-THREADAFFINITY ▸ Potentially causing CPU cache invalidation ▸ Lack of parameters allowing you to achieve fine grained control
  • 67.
    AFFINITY POOL External Component Submits atask R1 R2 Rn T1 R1 R2 Rn T2 R1 R2 Rn Tn Queue selector Picks the queue to submit to Dequeue tasks
  • 68.
    FAIR DISTRIBUTION QUEUESELECTOR ▸ Adaptive work assignment strategy ▸ Few actors - explicit mapping (fairer) ▸ More actors - consistent hashing (cheaper)
  • 69.
    ADVANTAGES ▸ Less cachehits due to temporal locality ▸ Decreases contention ▸ Customisable queue selection
  • 70.
    CLIENT ACTOR class UserQueryActor(latch:CountDownLatch, numQueries: Int, numUsersInDB: Int) extends Actor { private var left = numQueries private val receivedUsers: mutable.Map[Int, User] = mutable.Map() private val randGenerator = new Random() override def receive: Receive = { case u: User { receivedUsers.put(u.userId, u) if (left == 0) { latch.countDown() context stop self } else { sender() ! Request(randGenerator.nextInt(numUsersInDB)) } left -= 1 } } }
  • 71.
    SERVICE ACTOR class UserServiceActor(userDb:Map[Int, User], latch: CountDownLatch, numQueries: Int) extends Actor { private var left = numQueries def receive = { case Request(id) userDb.get(id) match { case Some(u) sender() ! u case None } if (left == 0) { latch.countDown() context stop self } left -= 1 } }
  • 72.
    BENCHMARK RESULTSMsg/s(M) 0 1 2 3 4 5 6 dispatcher.throughput 1 550 5,2 3,3 1,4 5 4,4 3,7 5,4 4,84,7 Affinity Fork Join Fixed Size
  • 73.
    SO… IF YOUARE ON A HOT CODEPATH ▸ Make sure you really are ▸ Measure everything (sbt-jmh, sbt-jol, perfstat …) ▸ Watch out for language features that can introduce unintended allocations (e.g. pattern matching) ▸ Use algorithms and data structures that are cache friendly ▸ Use efficient concurrency tools but try to not roll your own: JCTools, Akka, Vert.x …
  • 74.
    RESOURCES ▸ Name basedextractors
 https://hseeberger.wordpress.com/2013/10/04/name-based-extractors-in-scala-2-11/ ▸ Cache oblivious algorithms (MIT OCW)
 https://www.youtube.com/watch?v=CSqbjfCCLrU ▸ Lazy Set in detail
 http://psy-lob-saw.blogspot.bg/2012/12/atomiclazyset-is-performance-win-for.html ▸ Processor Counter Monitor
 https://github.com/opcm/pcm ▸ Memory Access Patterns
 https://mechanical-sympathy.blogspot.bg/2012/08/memory-access-patterns-are- important.html ▸ False Sharing
 https://mechanical-sympathy.blogspot.bg/2011/07/false-sharing.html
 https://mechanical-sympathy.blogspot.bg/2013/02/cpu-cache-flushing-fallacy.html ▸ Slides and code
 https://github.com/zaharidichev/scala-days-2018-berlin