Garbage Collection Pause Times - Angelika Langer

Angelika Langer
Trainer/Consultant
http://www.AngelikaLanger.com/
Java Performance
Garbage Collection
Pauses

© Copyright 1995-2015 by Angelika Langer & Klaus Kreft. All Rights Reserved.
last update: 18/03/2014 08:06
gc pauses (2)
objective
• what causes long GC pauses?
• what does GC do during a STW pause?
• how can I reduce the pause time?
• explore HotSpot JVM's GC algorithms
• point out reasons for long pauses
• discuss tuning options
• glance at alternative GC in other JVMs

last update: 18/03/2014 08:06
gc pauses (3)
speaker's relationship to topic
• independent trainer / consultant / author
– teaching Java for ~20 years
– curriculum of some challenging seminars
– JCP observer and Java champion since 2005
– co-author of "Effective Java" column
– author of Java Generics FAQ and Lambda Tutorial & Reference

last update: 18/03/2014 08:06
gc pauses (4)
garbage collection
• purpose
– make memory occupied by unreachable objects available
for subsequent memory allocation
– in order to allow for 24/7 service with finite memory resources
• involves several activities
– garbage detection
– garbage elimination

last update: 18/03/2014 08:06
gc pauses (5)
cost of garbage collection
• stop-the-world (STW) phases stop all application threads
– problem for applications with time constraints
e.g. user interaction, SLA applications, ...
• concurrent GC phases steal CPU cycles from application
– problem for application with performance / throughput constraints
e.g. transactions per second, ...

last update: 18/03/2014 08:06
gc pauses (6)
trade-off
memory footprint
throughput pause time

last update: 18/03/2014 08:06
gc pauses (7)
what causes long pause times?
• what does GC do during a STW pause?
• how can I reduce the pause time?
• agenda for this talk
– explore the HotSpot JVM's GC algorithms
– point out obvious (and not so obvious) reasons for long pauses
– discuss tuning options
– glance at GC in other JVMs

last update: 18/03/2014 08:06
gc pauses (8)
agenda
• classic HotSpot GC algorithms
– parallel GC
– concurrent GC
– G1
• reasons for long pauses
• pause time tuning
• alternative garbage collectors
– Shenandoah
– Azul & JRocket

last update: 18/03/2014 08:06
gc pauses (9)
parallel GC
• a generational collector
– based on the assumption that most objects die young

last update: 18/03/2014 08:06
gc pauses (10)
parallel GC (cont.)
• organizes heap into different areas (generations)
• uses different algorithms per generation
object lifetimeeden survivor
spaces
old (tenured) generationyoung generation

last update: 18/03/2014 08:06
gc pauses (11)
minor GC
• copy algorithm
– scans all references into the young generation
– copies all reachable objects (survivors) into survivor space
– also handles promotion to old gen
– updates references to relocated objects
– frees the entire young generation en bloc
• performed frequently
– by multiple GC threads in parallel

last update: 18/03/2014 08:06
gc pauses (12)
minor GC
• upside
– eden is an empty block of memory afterwards
very efficient subsequent memory allocation
– no fragmentation
survivor space is compact
• downside
– stop-the-world pause
proportional to number of survivors
– higher footprint
needs free space as destination for copying

last update: 18/03/2014 08:06
gc pauses (13)
full GC
• mark-and-compact algorithm
– follows all references into the heap
– marks all reachable objects
– sweeps all dead objects (i.e. marks their memory as "free")
– compacts the old generation
– updates references to relocated objects
• performed rarely
– by multiple GC threads in parallel

last update: 18/03/2014 08:06
gc pauses (14)
full GC
• upside
– no fragmentation
• downside
– stop-the-world pause proportional to
number of survivors (in marking phase) and
size of heap (in compaction phase)

last update: 18/03/2014 08:06
gc pauses (15)
inter-generational references
• generational GC has a downside
– works on a subset of the heap (young gen in minor GC)
• all references into young must be scanned
– root references (from outside the heap into the young gen)
stack variables, static fields, references from JIT compiled code, etc.
– inter-generational references (from old gen into young gen)
require write barriers

last update: 18/03/2014 08:06
gc pauses (16)
old-to-young references
roots
starting points for
young generation
marking
young old
card table
*
*

last update: 18/03/2014 08:06
gc pauses (17)
write barriers
• a barrier is additional code
– executed when a reference is modified
– must catch when application creates an old-to-young reference
– sets a dirty bit in a card table
• card table is later processed in GC pause
– to find the actual inter-generational references

last update: 18/03/2014 08:06
gc pauses (18)
cost of generational GC
• extra effort for inter-generational references
– slows down application (due to write barrier)
– increases pauses (due to card table processing)
• floating garbage
– dead (not yet collected) objects in old gen
might prevent collection of dead objects in young gen

last update: 18/03/2014 08:06
gc pauses (19)
parallel GC
minor GC
minor GC
full GC
- marking
- summary
- compaction
- copy
- copy

last update: 18/03/2014 08:06
gc pauses (20)
parallel GC - tuning
• pause time proportional to number of survivors and size of heap
– large heap => long pauses
=> not many tuning options
• some tuning ideas
– increase parallelism
increase number of parallel GC threads
provided that there are idle CPUs available
– let objects die in young gen
reason: GC in old gen is more expensive than in young gen
increase young gen size, survivor size, tenuring threshold, ...

last update: 18/03/2014 08:06
gc pauses (21)
some useful VM flags
-XX:+UseParallelGC -XX:+UseParallelOldGC
– select parallel GC on young and old gen
-XX:ParallelGCThreads=<number>
– specify number of GC threads
-Xmn<value> or -XX:NewRatio=<ratio>
– specify size of young gen
-XX:SurvivorRatio=<ratio>
– specify size of survivor spaces

last update: 18/03/2014 08:06
gc pauses (22)
agenda
– parallel GC
– concurrent GC
– G1
– Shenandoah
– Azul & JRocket

last update: 18/03/2014 08:06
gc pauses (23)
concurrent GC
• alternative algorithm on old generation
– STW copy algorithm on young generation
– concurrent mark-and-sweep (CMS) algorithm on old generation
• CMS has several phases
– initial marking phase (STW)
– marking phase (concurrent)
– final remarking phase (STW)
– sweep phase (concurrent)

last update: 18/03/2014 08:06
gc pauses (24)
concurrent marking
• marking must identify reachable objects
– while application is running and modifies the reference graph
• uses tricolor algorithm
– requires write barriers
• concurrent marking is not exact
– snap-shot-at-the-beginning (SATB) marking
i.e. objects stay alive, if they were reachable at the beginning
– very conservative; creates a lot of floating garbage

last update: 18/03/2014 08:06
gc pauses (25)
concurrent sweep
• sweeping adds free memory cells to free lists
– allocation in old gen requires lookup in free lists
=> more expensive allocation
– increases cost of minor GC
because promotion is more expensive
• sweeping leads to fragmentation
– higher risk of promotion failure
– if large objects are promoted
• fallback to full GC

last update: 18/03/2014 08:06
gc pauses (26)
cost of CMS
• increased minor GC pause time
– due to more expensive allocation in old gen via free lists
• substantially more floating garbage
– due to concurrent SATB marking
• extra effort for tricolor algorithm
– slows down application via write barriers
• long turn-around times
– complex algorithm takes longer until actual memory reclaim
• unreliable
– fallback to full GC in case of fragmentation

last update: 18/03/2014 08:06
gc pauses (27)
cost of CMS
• reduced pause time (on average)
• at the expense of
– higher memory consumption
– lower throughput

last update: 18/03/2014 08:06
gc pauses (28)
concurrent GC
minor GC
initial marking
final remarking
- concurrent marking
minor GC
minor GC
- concurrent sweep

last update: 18/03/2014 08:06
gc pauses (29)
CMS - tuning
• pause time depends on
– amount of work left for remarking phase
i.e. number of grey cells created by application's activities
– degree of fragmentation
i.e. fallback to full GC
– get more done concurrently
increase number of threads in concurrent phases
– reduce CMS's workload
let objects die in young gen instead of old gen
increase young gen size, survivor size, tenuring threshold, ...
– start marking cycles earlier
to avoid fallback to full GC
lower the occupancy threshold that initiates the cycle

last update: 18/03/2014 08:06
gc pauses (30)
-XX:+UseConcMarkSweepGC
– select CMS on old gen (automatically uses parallel young GC)
-XX:CMSInitiatingOccupancyFraction=<percent>
-XX:+UseCMSInitiatingOccupancyOnly
– lower threshold that starts CMS cycle
-XX:ConcGCThreads=<n>
– specify number of threads in concurrent phases

last update: 18/03/2014 08:06
gc pauses (31)
agenda
– parallel GC
– concurrent GC
– G1
– Shenandoah
– Azul & JRocket

last update: 18/03/2014 08:06
gc pauses (32)
"garbage first" (G1) GC
• a generational garbage collector
– organizes the heap into regions of identical size
– copy algorithm => no fragmentation
young survivor old
young mode

last update: 18/03/2014 08:06
gc pauses (33)
mixed mode collections
• builds collection set dynamically
– collection set = regions to be included into next GC
• two modes
– young: all young regions are collected
– mixed: old regions with a lot of garbage are included
young survivor old
mixed mode
collection set

last update: 18/03/2014 08:06
gc pauses (34)
remembered sets
• partial GC (on subset of heap) requires
– maintenance of references into the collection set
– all regions have a remembered set
• remembered set (RS)
– list of references from outside the region into the region

last update: 18/03/2014 08:06
gc pauses (35)
remembered set (cont.)
• RS maintenance requires write barriers
– must catch creation of inter-regional references
– RS update tasks are put into a work queue
processed concurrently by background threads, or
when STW GC pause starts
• simplification (in order to reduce RS overhead)
– references originating from young regions are not recorded in RS
– instead: all young regions are included into each GC

last update: 18/03/2014 08:06
gc pauses (36)
concurrent marking
• G1 performs a concurrent SATB marking
– similar to CMS's marking
– initial marking phase piggybacked on young GC
– no sweep phase
– instead a concurrent cleanup phase
reclaims entirely empty old regions en bloc
• marking information used for internal statistics
– GC efficiency calculation, i.e. amount of garbage per old region
– liveness info, i.e. are origins of RS entries still alive

last update: 18/03/2014 08:06
gc pauses (37)
G1 pauses
• only two main GC parameters:
-XX:GCPauseIntervalMillis=500
-XX:MaxGCPauseMillis=200
GC pauseapplication running
GCPauseIntervalMillis
<
<
MaxGCPauseMillis

last update: 18/03/2014 08:06
gc pauses (38)
G1 pauses
• evacuation pause in young mode
– proportional to number of young regions and number of survivors
therein
– also depends on cost of pending RS updates
– self-adjusting
G1 tries to create only as many young regions as can be collecting within
pause time goal
does not always work out
• evacuation pause in mixed mode
– depends on pause time goal
– self-adjusting
G1 includes only old regions with lots of garbage and only as many as fit
into the pause time goal
does not always work out

last update: 18/03/2014 08:06
gc pauses (39)
G1 pauses (cont.)
• full evacuation pause
– includes all regions into collection set
– if heap is almost full and cannot be expanded any further
– proportional to number of regions and survivors
• remarking
– amount of work left for remarking phase
• cleanup
– number of empty old regions

last update: 18/03/2014 08:06
gc pauses (40)
cost of G1
• extra effort for remembered sets
– slows down application via write barriers
– background GC threads for concurrent RS update
– increased pause time for RS update in evacuation pause
• long turn-around times
– complex algorithm takes longer until actual memory reclaim
• some amount of floating garbage
– due to concurrent SATB marking
• unreliable
– pause time goal not guaranteed

last update: 18/03/2014 08:06
gc pauses (41)
cost of G1
• upside
– reduced pause time (compared to parallel GC)
– no fragmentation (compared to CMS)
– fully self-adapting
• downside
– higher memory consumption
– lower throughput

last update: 18/03/2014 08:06
gc pauses (42)
G1 GC
young GC
initial marking
final remarking
minor GC
mixed GC
- evacuation
young GC
- concurrent cleanup

last update: 18/03/2014 08:06
gc pauses (43)
G1 - tuning
– avoid over-tuning
set realistic pause time goals
– start marking cycles earlier
to avoid full GC
lower the occupancy threshold that initiates the cycle
– use more GC threads
increase number of threads in concurrent and STW phases

last update: 18/03/2014 08:06
gc pauses (44)
-XX:+UseG1GC
– select G1
-XX:GCPauseIntervalMillis -XX:MaxGCPauseMillis=<ms>
– specify pause time and interval goals
-XX:InitiatingHeapOccupancyPercent=<percent>
– lower threshold that starts marking cycle
-XX:ConcGCThreads=<n> -XX:ParallelGCThreads=<n>
– specify number of threads in concurrent and STW phases

last update: 18/03/2014 08:06
gc pauses (45)
agenda
• less obvious reasons for long pauses
– humongous objects
– soft/weak references
– trace output
– Shenandoah
– Oracle JRockit
– Azul "C4"

last update: 18/03/2014 08:06
gc pauses (46)
Shenandoah
• an alternative GC algorithm for OpenJDK
– (project name: Shenandoah)
– submitted by RedHat
• goal
– manage 100GB+ heaps with < 10ms pause times
– pause times proportional to size of root set, not size of heap

last update: 18/03/2014 08:06
gc pauses (47)
Shenandoah
• very similar to G1
– organized into regions
– concurrent SATB marking
– dynamically composed collection set based on GC efficiency
– ...
• key difference
– no notion of generations
age does not matter, only GC efficiency does
– concurrent evacuation
no STW pause for copying survivors
no remembered sets
survivors are copied invidually on write access

last update: 18/03/2014 08:06
gc pauses (48)
Shenandoah phases
• concurrent marking
– visit all reachable objects (starting with root references)
– needs a STW initial & remark pause (just like CMS and G1 do)
– afterwards there is liveness info for all regions
• concurrent evacuation
– select "garbage first" regions for evacuation ("from" space)
– select free target regions for evacuation ("to" space)
– scan reachable objects in selected "from" regions
– add a forwarding pointer to each object that must be relocated
– but do not copy it (yet)

last update: 18/03/2014 08:06
gc pauses (49)
forwarding pointer
"from" region "to" region
survivor
dead
live new location

last update: 18/03/2014 08:06
gc pauses (50)
Shenandoah phases (cont.)
• concurrent evacuation (cont.)
– first write access to survivor in "from" creates copy in "to"
– subsequent read access is redirected to copy
– references to relocated object are updated in next marking phase
– afterwards all "from" regions are reclaimed

last update: 18/03/2014 08:06
gc pauses (51)
survivor new location
copy on write (COW)
"from" region "to" region
survivor
dead
copylive

last update: 18/03/2014 08:06
gc pauses (52)
after next marking cycle
"to" region
copy
"from" region
survivor
dead
live

last update: 18/03/2014 08:06
gc pauses (53)
Shenandoah
initial marking
final remarking
initial marking
- memory reclaim
- concurrent evacuation
- copying
- concurrent evacuation
- copying
final remarking

last update: 18/03/2014 08:06
gc pauses (54)
evaluation
• upside
– few short STW pauses
– no remembered set overhead
• downside
– increased object size (due to forwarding pointer)
– more expensive write barriers
trigger object copying
note: applications threads (not GC threads) create the copies
– temporarily more expensive read access
due to indirection via forwarding pointer
– floating garbage
– long turn-around time

last update: 18/03/2014 08:06
gc pauses (55)
agenda
– trace output
– Shenandoah
– Oracle JRockit
– Azul "C4"

last update: 18/03/2014 08:06
gc pauses (56)
JRockit GC
• GC functionality different from HotSpot
– more modular
• ‘mix & match’
– heap partitioning
– GC algorithm
– compaction
• can switch GC algorithms and strategies at runtime
– at least to a certain extent

last update: 18/03/2014 08:06
gc pauses (57)
heap partitioning
• un-partitioned
– single heap
• generational
– two areas: nursery (~ young generation) + old generation
– nursery contains keep area
 most recently allocated objects
 not copied to old gen during young GC
 avoids premature promotion of short-lived objects

last update: 18/03/2014 08:06
gc pauses (58)
young gen collector
• collects nursery (if present)
• scavenger GC
– copies all live objects from nursery to old generation
 does not touch keep area
– stop-the-world
– uses all available CPU cores
– resembles HotSpot's parallel young GC (w/o survivor space)

last update: 18/03/2014 08:06
gc pauses (59)
old gen collector
• collect old generation (gen) or entire heap (single)
• algorithm split into mark-and-sweep GC + compaction
strategy
– select GC algorithm and compaction strategy independently

last update: 18/03/2014 08:06
gc pauses (60)
two mark-and-sweep GC algorithms
• parallel mark-and-sweep GC
– stop-the-world
– uses all available CPU cores
– resembles HotSpot's parallel old GC (w/ compaction)
• concurrent mark-and-sweep GC
– "mostly concurrent"
– short stop-the-world-pauses during marking and sweeping
– resembles HotSpot's CMS
HotSpot sweeps concurrently (w/o STW pauses)

last update: 18/03/2014 08:06
gc pauses (61)
compaction
• partial compaction (on only a part of the heap)
– one or two windows traveling the heap
– window size is adjustable
– external or internal compaction
• compaction runs as a stop-the-world pause
– during sweep phase
top bottom
external
internal

last update: 18/03/2014 08:06
gc pauses (62)
-XgcPrio:deterministic - JRockit Real Time
• GC split up into work packets
– e.g. a compaction job for part of the heap
• if it takes too long, throw away the work packet
– re-try later, re-using partial results if possible

last update: 18/03/2014 08:06
gc pauses (63)
agenda
– trace output
– Shenandoah
– Oracle JRockit
– Azul "C4"

last update: 18/03/2014 08:06
gc pauses (64)
Azul "C4"
• commercial JVM with a no-pause collector named "C4"
– "C4" = Continuously Concurrent Compacting Collector
• special purpose JVM (the so-called Zing platform)
– runs virtualized on top of the actual OS and includes its own
operating environment
• C4 algorithm makes massive and rapid changes to
virtual memory mappings
– regular Linux has technical remapping limitations
– Zing has its own virtual memory subsystem that supports
memory remaps, unmaps, etc. as needed for "C4"

last update: 18/03/2014 08:06
gc pauses (65)
Azul "C4" - how does it differ ?
• same core mechanism used for both generations
– concurrent mark-compact
– old and young generation collectors run simultaneously
and concurrently with the application threads
old gen mark-compact
young gen mark-compact
• algorithm has 3 phases
– mark
– relocate
– remap

last update: 18/03/2014 08:06
gc pauses (66)
"C4" phases
• mark phase
– trace generation’s live set by starting from roots
mark all encountered objects as live
mark all encountered object references as marked through
• relocate phase
– compact memory by relocating live objects into contiguously populated
target pages
free sparse pages based on liveness totals collected during previous mark phase
each “from” page is protected, its objects are relocated to new “to” pages
– forwarding information is stored outside the “from” page
– “from” page’s physical memory is immediately recycled

last update: 18/03/2014 08:06
gc pauses (67)
"C4" phases (cont.)
• remap phase
– remapping occurs when mutator threads encounter stale
references to relocated objects
stale references are corrected to point to current object address
remap phase is combined with next GC cycle’s mark phase
– at the end of remap phase, no stale references will exist
virtual addresses associated with relocated “from” can be safely recycled
– “no hurry” to finish remap phase
there are not physical resources being held

last update: 18/03/2014 08:06
gc pauses (68)
combined mark-remap phase
mark
relocate
remap
mark
relocate
remap
mark
relocate
remap

last update: 18/03/2014 08:06
gc pauses (69)
GC comparison
concurrent mark-compact
mostly concurrent or STW
parallel mark-sweep
STW incremental compact
mostly concurrent mark
concurrent compact
mostly concurrent mark
STW incremental compact
mostly concurrent mark-sweep
STW mark-compact
old
STW mark-compactSTW copyG1Oracle HotSpot
concurrent
mark-compact
C4Azul Zing
STW mark-compactN/ARealTimeOracle JRockit
???N/AShenandoahOpenJDK
STW mark-compactSTW copyCMSOracle HotSpot
STW copyParallelGCOracle HotSpot
fallbackyoungcollector

last update: 18/03/2014 08:06
gc pauses (70)
garbage collection pauses
Q & A
AngelikaAngelika LangerLanger
http://www.AngelikaLanger.com
twitter: @AngelikaLanger

Angelika Langer & Klaus Kreft
Java 8
Stream
Performance

last update: 10/13/2015,10:06
Stream Performance (2)
objective
• how do streams perform?
– explore whether / when parallel streams outperfom seq. streams
– compare performance of streams to performance of regular loops
• what determines stream performance?
– take a glance at some stream internal mechanisms

last update: 10/13/2015,10:06
speaker's relationship to topic
• independent trainer / consultant / author
– teaching C++ and Java for ~20 years
– curriculum of half a dozen challenging Java seminars
– JCP observer and Java champion since 2005
– co-author of "Effective Java" column
– author of Java Generics FAQ
– author of Lambda Tutorial & Reference

last update: 10/13/2015,10:06
agenda
• introduction
• loop vs. sequential stream
• sequential vs. parallel stream

last update: 10/13/2015,10:06
what is a stream?
• equivalent of
sequence from functional programming languages
– object-oriented view: internal iterator pattern
 see GOF book for more details
• idea
myStream. forEach ( s -> System.out.print(s) );
stream operation user-defined functionality
applied to each element

last update: 10/13/2015,10:06
fluent programming
myStream. filter ( s -> s.length() > 3 )
. mapToInt ( s -> s.length() )
. forEach ( System.out::print );
stream operation user-defined functionality
applied to each element
intermediate
operations
terminal
operation

last update: 10/13/2015,10:06
obtain a stream
• collection:
• array:
• resulting stream
– does not store any elements
– just a view of the underlying stream source
• more stream factories, but not in this talk
myCollection.stream(). ...
Arrays.stream(myArray). ...

last update: 10/13/2015,10:06
parallel streams
• collection:
• array:
• performs stream operations in parallel
– i.e. with multiple worker threads from fork-join common pool
myCollection.parallelStream(). ...
Arrays.stream(myArray).parallel(). ...
myParallelStream.forEach(s -> System.out.print(s));

last update: 10/13/2015,10:06
stream functionality rivals loops
• Java 8 streams:
• since Java 5:
• pre-Java 5: Iterator iter = myCol.iterator();
while (iter.hasNext()) {
String s = iter.next();
if (s.length() > 3)
System.out.print(s.length());
}
for (String s : myCol)
if (s.length() > 3)
System.out.print(s.length());
myStream.filter(s -> s.length() > 3)
.mapToInt(s -> s.length())
.forEach(System.out::print);
myStream.filter(s -> s.length() > 3)
.forEach(s->System.out.print(s.length()));

last update: 10/13/2015,10:06
obvious question …
… how does the performance compare ?
• loop vs. sequential stream vs. parallel stream

last update: 10/13/2015,10:06
benchmarks …
… done on an older desktop system with:
– Intel E8500,
 2 x 3,17GHz
 4GB RAM
– Win 7
– JDK 1.8.0_05
• disclaimer: your mileage may vary
– i.e. parallel performance heavily depends on number of CPU-Cores

last update: 10/13/2015,10:06
agenda
• introduction

last update: 10/13/2015,10:06
how do sequential stream work?
• example
• filter() and mapToInt() return streams
– intermediate operations
• reduce() returns int
– terminal operation,
– that produces a single result from all elements of the stream
String[] txt = { "State", "of", "the", "Lambda",
"Libraries", "Edition"};
IntStream is = Arrays.stream(txt).filter(s -> s.length() > 3)
.reduce(0, (l1, l2) -> l1 + l2);

last update: 10/13/2015,10:06
pipelined processing
"State" "of" "the" "Lambda" "Libraries" "Edition"
5 6 9 7
"State" "Lambda" "Libraries" "Edition"
code looks like
really executed
filter
mapToInt
Arrays.stream(txt).filter(s -> s.length() > 3)
.reduce(0, (l1, l2) -> l1 + l2);
reduce
5 11 20 270

last update: 10/13/2015,10:06
benchmark with int-array
• int[500_000], find largest element
– for-loop:
– sequential stream:
int[] a = ints;
int e = ints.length;
int m = Integer.MIN_VALUE;
for (int i = 0; i < e; i++)
if (a[i] > m) m = a[i];
int m = Arrays.stream(ints)
.reduce(Integer.MIN_VALUE, Math::max);

last update: 10/13/2015,10:06
results
for-loop: 0.36 ms
seq. stream: 5.35 ms
• for-loop is ~15x faster
• are seq. streams always much slower than loops?
– no, this is the most extreme example
– lets see the same benchmark with an ArrayList<Integer>
 underlying data structure is also an array
 this time filled with Integer values, i.e. the boxed equivalent of int

last update: 10/13/2015,10:06
benchmark with ArrayList<Integer>
• find largest element in an ArrayList with 500_000
elements
– for-loop:
int m = Integer.MIN_VALUE;
for (int i : myList)
if (i > m) m = i;
int m = myList.stream()

last update: 10/13/2015,10:06
results
ArrayList, for-loop: 6.55 ms
ArrayList, seq. stream: 8.33 ms
• for-loop still faster, but only 1.27x
• iteration for ArrayList is more expensive
– boxed elements require an additional memory access (indirection)
– which does not work well with the CPU’s memory cache
• bottom-line:
– iteration cost dominates the benchmark result
– performance advantage of the for-loop is insignificant

last update: 10/13/2015,10:06
some thoughts
• previous situation:
– costs of iteration are relative high, but
– costs of functionality applied to each element are relative low
 after JIT-compilation:
more or less the cost of a compare-assembler-instruction
• what if we apply a more expensive functionality
to each element ?
– how will this affect the benchmark results ?

last update: 10/13/2015,10:06
expensive functionality
• slowSin()
from Apache Commons Mathematics Library
– calculates a Taylor approximation of the sine function value
for the parameter passed to this method
– (normally) not in the public interface of the library
 used to calculate values for an internal table,
 which is used for interpolation by FastCalcMath.sin()

last update: 10/13/2015,10:06
benchmark with slowSin()
• int array / ArrayList with 10_000 elements
– for-loop:
– code for ArrayList changed respectively
int[] a = ints;
int e = a.length;
double m = Double.MIN_VALUE;
for (int i = 0; i < e; i++) {
double d = Sine.slowSin(a[i]);
if (d > m) m = d;
}
Arrays.stream(ints)
.mapToDouble(Sine::slowSin)
.reduce(Double.MIN_VALUE, Math::max);

last update: 10/13/2015,10:06
results
int[], for-loop: 11.72 ms
int[], seq. stream: 11.85 ms
ArrayList, for-loop: 11.84 ms
ArrayList, seq. stream: 11.85 ms
• for-loop is not really faster
• reason:
– applied functionality costs dominate the benchmark result
– performance advantage of the for-loop has evaporated

last update: 10/13/2015,10:06
other aspect (without benchmark)
• today, compilers (javac + JIT) can optimize
loops better than stream code
• reasons:
– linear code (loop) vs. injected functionality (stream)
– lambdas + method references are new to Java
– loop optimization is a very mature technology
– …

last update: 10/13/2015,10:06
for-loop vs. seq. stream / re-cap
• sequential stream can be slower or as fast as for-loop
• depends on
– costs of the iteration
– costs of the functionality applied to each element
• the higher the cost (iteration + functionality)
the closer is stream performance
to for-loop performance

last update: 10/13/2015,10:06
agenda
• introduction
– introduction
– stateless functionality
– stateful functionality

last update: 10/13/2015,10:06
parallel streams
• library side parallelism
– important feature
 you need not know anything about threads, etc.
 very little implementation effort, just: parallel
• performance aspect
– outperform loops, which are inherently sequential

last update: 10/13/2015,10:06
how do parallel stream work?
• example
• parallel()’s functionality is based on
the fork-join framework
final int SIZE = 64;
int[] ints = new int[SIZE];
ThreadLocalRandom rand = ThreadLocalRandom.current();
for (int i=0; i<SIZE; i++) ints[i] = rand.nextInt();
Arrays.stream(ints)
.parallel()
.reduce(Math::max)
.ifPresent(System.out.println(m -> “max is: ” + m));

last update: 10/13/2015,10:06
fork join tasks
• original task is divided into two sub-tasks
by splitting the stream source into two parts
– original task’s result are based on sub-tasks’ results
– sub-tasks are divided again … fork phase
• at a certain depth partitioning stops
– tasks at this level (leaf tasks) are executed
– execution phase
• completed sub-task results
are ‘combined’ to super-task results
– join phase

last update: 10/13/2015,10:06
find largest element with parallel stream
T
fork phase execution join phase
reduce((i,j) -> Math.max(i,j));
0_63
T2
T1
0_31
32_63 T22
T21
T12
T11
0_15
16_31
32_47
48_63
m48_63
m32_47
m16_31
m0_15
T2
T1
max(m32_47,m48_63)
max(m0_15,m16_31)
m32_63
m0_31
T
max(m0_31,m32_63)
m0_63

last update: 10/13/2015,10:06
split level
• deeper split level than shown !!!
– execution/leaf tasks: ~ 4*numberOfCores
 8 tasks for a dual core CPU (only 4 in the previous diagram)
– i.e. one additional split (only 2 in the previous graphic)
• key abstractions
– java.util.Spliterator
– java.util.concurrent.ForkJoinPool.commonPool()

last update: 10/13/2015,10:06
what is a Spliterator ?
• spliterator = splitter + iterator
• each type of stream source has its own spliterator type
– knows how to split the stream source
 e.g. ArrayList.ArrayListSpliterator
– knows how to iterate the stream source
 in execution phase

last update: 10/13/2015,10:06
what is the CommonPool ?
• common pool is a singleton fork-join pool instance
– introduced with Java 8
– all parallel stream operations use the common pool
 so does other parallel JDK functionality (e.g. CompletableFuture), too
• default: parallel execution of stream tasks uses
– (current) thread that invoked terminal operation, and
– (number of cores – 1) many threads from common pool
 if (number of cores) > 1
• this default configuration used for all benchmarks

last update: 10/13/2015,10:06
parallel streams + intermediate operations
• what if the stream contains
upstream intermediate operations
when/where are these applied to the stream ?
... .parallelStream().filter(...)
.mapToInt(...)
.reduce((i,j) -> Math.max(i,j));

last update: 10/13/2015,10:06
find largest element in parallel
filter(...).mapToInt(...).reduce((i,j) -> Math.max(i,j));
. . . . . … .
filter
mapToInt
reduce
T
T2
T1
T22
T21
T12
T11
T2
T1
T
execution

last update: 10/13/2015,10:06
parallel overhead …
… compared to sequential stream algorithm
• algorithm is more complicated / resource intensive
– create fork-join-task objects
 splitting
 fork-join-task objects creation
– thread pool scheduling
– …
• plus additional GC costs
– fork-join-task objects have to be reclaimed

last update: 10/13/2015,10:06
agenda
• introduction
– introduction

last update: 10/13/2015,10:06
back to the first example / benchmark parallel
• find largest element, array / collection, 500_000 elements
– parallel stream:
int m = Arrays.stream(ints)
int m = Arrays.stream(ints).parallel()
int m = myCollection.stream()
int m = myCollection.parallelStream()

last update: 10/13/2015,10:06
results
seq. par. seq./par.
int-Array 5.35 ms 3.35 ms 1.60
ArrayList 8.33 ms 6.33 ms 1.32
LinkedList 12.74 ms 19.57 ms 0.65
HashSet 20.76 ms 16.01 ms 1.30
TreeSet 19.79 ms 15.49 ms 1.28

last update: 10/13/2015,10:06
result discussion
• why is parallel LinkedList performance so bad ?
– hard to split
– needs 250_000 iterator’s next() invocations for the first split
 with ArrayList: just some index computation
• performance of the other collections is also not so great
– functionality applied to each element is not very CPU-expensive
 after JIT-compilation: cost of a compare-assembler-instruction
– iteration (element access) is relative expensive (indirection !)
 but not CPU expensive
– but more CPU-power is what we have with parallel streams

last update: 10/13/2015,10:06
result discussion (cont.)
• why is parallel int-array performance relatively good ?
– iteration (element access) is no so expensive (no indirection !)

last update: 10/13/2015,10:06
CPU-expensive functionality
• back to slowSin()
– calculates a Taylor approximation of the sine function value
for the parameter passed to this method
– CPU-bound functionality
 needs only the initial parameter from memory
 calculation based on it’s own (intermediate) results
– ideal to be speed up by parallel streams with multiple cores

last update: 10/13/2015,10:06
benchmark parallel with slowSin()
• array / collection with 10_000 elements
– array:
– collection:
myCollection.stream() // .parallelStream()
.reduce(Double.MIN_VALUE, (i, j) -> Math.max(i, j);
Arrays.stream(ints) // .parallel()
.reduce(Double.MIN_VALUE, (i, j) -> Math.max(i, j);

last update: 10/13/2015,10:06
results
seq. par. seq./par.
int-Array 10.81 ms 6.03 ms 1.79
ArrayList 10.97 ms 6.10 ms 1.80
LinkedList 11.15 ms 6.25 ms 1.78
HashSet 11.15 ms 6.15 ms 1.81
TreeSet 11.14 ms 6.30 ms 1.77

last update: 10/13/2015,10:06
result discussion
• performance improvements for all stream sources
– by a factor of ~ 1.8
 even for LinkedList
• the ~1.8 is the maximum improvement on our platform
– the remaining 0.2 are
 overhead of the parallel algorithm
 sequential bottlenecks (Amdahl’s law)

last update: 10/13/2015,10:06
sufficient size (without benchmark)
• stream source must have a sufficient size,
so that it benefits from parallel processing
• overhead increases with growing number of cores
– number of tasks ~ 4*number of cores
– (in most cases) not with the size of the stream source
• Doug Lea mentioned 10_000 for CPU-inexpensive funct.
– http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
• 500_000 respectively 10_000 in our examples
– size can be smaller for CPU-expensive functionality

last update: 10/13/2015,10:06
dynamic overclocking (without benchmark)
• modern multi-core CPU typically increases the
CPU-frequency when not all of its cores are active
– Intel call this feature: turbo boost
• benchmark sequential versus parallel stream
– seq. test might run with a dynamically overclocked CPU
– will this also happen in the real environment or only in the test?
• no issue with our test system
– too old
– no dynamic overclocking supported

last update: 10/13/2015,10:06
agenda
• introduction
– introduction

last update: 10/13/2015,10:06
stateful functionality …
… with parallel streams / multiple threads boils down to
shared mutable state
• costs performance to handle this
– e.g. lock-free CAS, requires retries in case of collision
• traditionally not supported with sequences
– functional programming languages don’t have mutable types, and
– often no parallel sequences either
• new solutions/approaches in Java 8 streams

last update: 10/13/2015,10:06
stateful functionality with Java 8 streams
• intermediate stateful operations, e.g. distinct()
– see javadoc: This is a stateful intermediate operation.
– shared mutable state handled by stream implementation (JDK)
• (terminal) operations that allow stateful functional
parameters, e.g.
forEach(Consumer<? super T> action)
– see javadoc: If the action accesses shared state, it is responsible
for providing the required synchronization.
– shared mutable state handled by user/client code

last update: 10/13/2015,10:06
stateful functionality with Java 8 streams (cont.)
• stream’s overloaded method: collect()
– shared mutable state handled by stream implementation, and
– collector functionality
 standard collectors from Collectors (JDK)
 user-defined collector functionality (JDK + user/client code)
• don’t have time to discuss all situations
– only discuss distinct()
– shared mutable state handled by stream implementation (JDK)

last update: 10/13/2015,10:06
distinct()
• element goes to the result stream,
if it hasn’t already appeared before
– appeared before, in terms of equals()
– shared mutable state: elements already in the result stream
 have to compare the current element to each element of the output stream
• parallel introduces a barrier (algorithmic overhead)
.parallelStream().statelessOps().distinct().statelessOps().terminal();
two alternative
algorithms

last update: 10/13/2015,10:06
two algorithms for parallel distinct()
• ordering + distinct()
– normally elements go to the next stage, in the same order in which
they appear for the first time in the current stage
• javadoc from distinct()
– Removing the ordering constraint with unordered() may result in
significantly more efficient execution for distinct() in parallel
pipelines, if the semantics of your situation permit.
• two different algorithms for parallel distinct()
– one for ordered streams + one for unordered streams

last update: 10/13/2015,10:06
benchmark with distinct()
• Integer[100_000], filled with 50_000 distinct values
• results:
seq. par. ordered par. unordered
6.39 ms 34.09 ms 9.1 ms
// parallel ordered
Arrays.stream(integers).parallel().distinct().count();
// sequential
Arrays.stream(integers).distinct().count();
// parallel unordered
Arrays.stream(integers).parallel().unordered().distinct().count();

last update: 10/13/2015,10:06
benchmark with distinct() + slowSin()
• Integer[10_000], filled with numbers 0 … 9999
– after the mapping 5004 distinct values
• results:
seq. par. ordered par. unordered
11.59 ms 6.83 ms 6.81 ms
Arrays.stream(newIntegers) //.parallel().unordered()
.map(i -> new Double(2200* Sine.slowSin(i * 0.001)).intValue())
.distinct()
.count();

last update: 10/13/2015,10:06
sequential vs. parallel stream / re-cap
to benefit from parallel stream usage …
• … stream source …
– must have sufficient size
– should be easy to split
• … operations …
– should be CPU-expensive
– should not be stateful

last update: 10/13/2015,10:06
advice
• benchmark on target platform !
• previous benchmark:
– find largest element, LinkedList, 500_000 elements
• what if we use a quad-core-CPU (Intel i5-4590) ?
– will the parallel result be worse, better, … better than seq. … ?
seq. par. seq./par.
12.74 ms 19.57 ms 0.65
seq. par. seq./par.
5.24 ms 4.84 ms 1.08

last update: 10/13/2015,10:06
authors
Angelika LangerAngelika Langer
KlausKlaus KreftKreft
http://www.AngelikaLanger.com

last update: 10/13/2015,10:06
stream performance
Q & A

Garbage Collection Pause Times - Angelika Langer

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Garbage Collection Pause Times - Angelika Langer

Similar to Garbage Collection Pause Times - Angelika Langer (20)

More from JAXLondon_Conference

More from JAXLondon_Conference (20)

Recently uploaded

Recently uploaded (20)

Garbage Collection Pause Times - Angelika Langer