Low latency & Mechanical Sympathy:
Issues and solutions
Jean-Philippe BEMPEL @jpbempel
Performance Architect http://jpbempel.blogspot.com
© ULLINK 2016
Low latency order router
2
• pure Java SE application
• FIX protocol / Direct Market Access connectivity (TCP)
• transform to an intermediate form (UL Message)
Performance target
3
process & route orders in
< 100us
Low latency
4
• Process under 1ms
• No (Disk) I/O involved
• Memory is the new disk
• Hardware architecture has a major impact on
performance
• Mechanical Sympathy
Agenda
5
• GC pauses
• Power Management
• NUMA
• Cache pollution
6
GC pauses
Generations
7
• Minor GC & Full GC are Stop the World (STW)
=> All app threads are stopped
• Minor GC pauses: 30-100ms
• Full GC pauses: couple of seconds to minutes
GC pauses
8
• None during trading hours
• Outside of trading hours, maintenance time allowed
• Java heap sized accordingly to avoid Full GC
Full GC
9
Some tricks to reduce both
• frequency
• pause time
Minor GC
10
• Increase Young Gen (512MB)
• But increasing YG can increase pause time
• Increase likelihood of objects die
Minor GC frequency
11
pause time factors:
• Refs traversing
• Card scanning
• Object Copy (survivor or tenured)
Minor GC pause time
12
Card Scanning
13
Young Gen
Card table
C C D
Old Gen
1 2 3 4
GC Roots
1 card = 512 bytes
Object Copy
14
• Need to keep orders during app lifetime
• Orders are relatively large objects
• Orders will be promoted to old generation
=> copy to survivor spaces and then to old gen
• Increase significantly the pause time for minor GC
Object Copy
15
• Access pattern: 90% of the time accessed immediately
after creation
• Weak references: avoid copying orders object
• persistence cache to allow reloading orders without too
much impact
• Allocation should be controlled carefully
• Reduce frequency of minor gc occurrence
• JDK API is not always friendly with GC/allocation
Allocation
16
need to rewrite some part:
conversion int -> char[]
int intToString(int i, char[] buffer, int offset)
avoid intermediate/temporary char[]
Allocation
17
Usage of empty collections & pre-sizing:
Collection<String> process() {
Collection<String> results = Collections.emptyList();
for (String s : list) {
if (results == Collections.<String>emptyList())
results = new ArrayList(list.size());
results.add(s);
}
return results;
}
Allocation
18
CMS
19
• Promotion allocation use Free Lists (like malloc)
• No Compaction of Old Gen
CMS
20
Old Gen
• Increase minor pause time
Full GC fallback unpredictable (fragmentation, promotion
failure, …)
176710.366: [GC 176710.366: [ParNew ( promotion failed):
471872K->455002K(471872K), 2.4424250 secs]176712.809: [CMS:
6998032K->5328833K(7864320K), 53.8558640 secs] 7457624K->5328833K(8336192K),
[CMS Perm : 199377K->198212K(524288K)], 56.2987730 secs] [Times: user=56.60
sys=0.78, real=56.29 secs]
Same problem for G1
CMS
21
• We are partners
• Best GC algorithm in the world
• Our tests show maximum pause time of 2ms
• Some overhead compared to HotSpot (8%)
• False issue syndrom
Zing JVM from Azul?
22
23
Power Management
• few orders per day, most of the time in idle
Why power management matters ?
24
• CPU embeds technologies to reduce power usage
• Frequency scaling during execution (P states)
P0 = nominal frequency
• Deep sleep modes during idle phases (C states)
C0 = Running, C1 = idle
• latency to wakeup
cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency
200 (us)
Power management technologies
25
BIOS Settings
26
BIOS settings impact
27
BIOS settings impact
28
BIOS settings impact
29
BIOS settings impact
30
• Turbo boost is dynamic frequency scaling
• Depends on thermal envelop
• Reducing cores reduce thermal envelop => higher freq
• Some vendors can fix this higher frequency
• Example:
intel E5 v3 14 cores => 2 cores, 2.6Ghz => 3.6GHz
Cores reduction/Fixed Turbo Boost
31
• Some OS drivers are also very aggressive
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
• Disabled in boot kernel parameters:
intel_idle.max_cstate=0
Power management at OS level
32
With intel_idle driver, 1 msg/s
33
Without intel_idle driver, 1 msg/s
34
With intel_idle driver, 100 msg/s
35
Without intel_idle driver, 100 msg/s
36
37
NUMA
Uniform Memory Access?
38
CPU
DRAM
Uniform Memory Access?
39
CPU
DRAM
CPU
• Non-Uniform Memory Access
• Starts from 2 cpus (sockets)
• 1 Memory controller per socket (Node)
• Local Memory vs remote memory
NUMA
40
NUMA architecture
41
Socket 0 Socket 1
MC
DRAM
MC
DRAM
QPI QPI
65ns 65ns40ns
• Avoid remote memory access
• Avoid scheduling and allocation on different node
• Bind process on one CPU only (numactl)
Restrictive but effective
Binding on node 0:
numactl --cpunodebind=0 --membind=0 java …
• BIOS: node interleaving disabled
How to deal with NUMA ?
42
43
Cache pollution
Cache hierarchy architecture
44
Socket 0
Core0 Core1
L2
L1
L3
L2
L1
DRAM
• CPU L3 shared cache across cores
• Non critical threads can “pollute” cache
• eviction of data required by critical threads
• => adds latency to re-access those data
• Need to isolate critical threads to non-critical ones
Cache pollution on L3
45
• OS Scheduling cannot work in this case
• Need to pin manually threads on cores
• Thread affinity allow us to do this
• Identify our critical and non-critical threads in application
Thread affinity
46
No thread affinity
47
With thread affinity
48
• Other processes can also pollute L3 cache
• Need to isolate all processes away from critical threads
• The whole CPU need to be dedicated
System processes & cache pollution
49
• No other processes should be scheduled on the same
socket (isolcpus)
• But if more than one thread per core need to use cpuset
Isolating cores
50
Socket 0 Socket 1
Core0 Core2
Core4 Core6
Core1 Core3
Core5 Core7
Scheduler
• Make GC pauses predictable
• Tune BIOS/OS for maximum performance
• Control NUMA effect by binding on one node
• Avoid cache pollution with Thread Affinity and CPU
isolation
Recommendations
51
• Can optimize a process without changing your code
• Know your hardware and your OS
• This is Mechanical Sympathy
Conclusion
52
• Mechanical Sympathy blog
http://mechanical-sympathy.blogspot.com/
• Mechanical Sympathy forum
https://groups.google.com/forum/#!forum/mechanical-sympathy
• Java Thread Affinity library
https://github.com/peter-lawrey/Java-Thread-Affinity
References
53
Q&A
Jean-Philippe BEMPEL @jpbempel
Performance Architect http://jpbempel.blogspot.com

Low latency & mechanical sympathy issues and solutions

  • 1.
    Low latency &Mechanical Sympathy: Issues and solutions Jean-Philippe BEMPEL @jpbempel Performance Architect http://jpbempel.blogspot.com © ULLINK 2016
  • 2.
    Low latency orderrouter 2 • pure Java SE application • FIX protocol / Direct Market Access connectivity (TCP) • transform to an intermediate form (UL Message)
  • 3.
    Performance target 3 process &route orders in < 100us
  • 4.
    Low latency 4 • Processunder 1ms • No (Disk) I/O involved • Memory is the new disk • Hardware architecture has a major impact on performance • Mechanical Sympathy
  • 5.
    Agenda 5 • GC pauses •Power Management • NUMA • Cache pollution
  • 6.
  • 7.
  • 8.
    • Minor GC& Full GC are Stop the World (STW) => All app threads are stopped • Minor GC pauses: 30-100ms • Full GC pauses: couple of seconds to minutes GC pauses 8
  • 9.
    • None duringtrading hours • Outside of trading hours, maintenance time allowed • Java heap sized accordingly to avoid Full GC Full GC 9
  • 10.
    Some tricks toreduce both • frequency • pause time Minor GC 10
  • 11.
    • Increase YoungGen (512MB) • But increasing YG can increase pause time • Increase likelihood of objects die Minor GC frequency 11
  • 12.
    pause time factors: •Refs traversing • Card scanning • Object Copy (survivor or tenured) Minor GC pause time 12
  • 13.
    Card Scanning 13 Young Gen Cardtable C C D Old Gen 1 2 3 4 GC Roots 1 card = 512 bytes
  • 14.
    Object Copy 14 • Needto keep orders during app lifetime • Orders are relatively large objects • Orders will be promoted to old generation => copy to survivor spaces and then to old gen • Increase significantly the pause time for minor GC
  • 15.
    Object Copy 15 • Accesspattern: 90% of the time accessed immediately after creation • Weak references: avoid copying orders object • persistence cache to allow reloading orders without too much impact
  • 16.
    • Allocation shouldbe controlled carefully • Reduce frequency of minor gc occurrence • JDK API is not always friendly with GC/allocation Allocation 16
  • 17.
    need to rewritesome part: conversion int -> char[] int intToString(int i, char[] buffer, int offset) avoid intermediate/temporary char[] Allocation 17
  • 18.
    Usage of emptycollections & pre-sizing: Collection<String> process() { Collection<String> results = Collections.emptyList(); for (String s : list) { if (results == Collections.<String>emptyList()) results = new ArrayList(list.size()); results.add(s); } return results; } Allocation 18
  • 19.
  • 20.
    • Promotion allocationuse Free Lists (like malloc) • No Compaction of Old Gen CMS 20 Old Gen • Increase minor pause time
  • 21.
    Full GC fallbackunpredictable (fragmentation, promotion failure, …) 176710.366: [GC 176710.366: [ParNew ( promotion failed): 471872K->455002K(471872K), 2.4424250 secs]176712.809: [CMS: 6998032K->5328833K(7864320K), 53.8558640 secs] 7457624K->5328833K(8336192K), [CMS Perm : 199377K->198212K(524288K)], 56.2987730 secs] [Times: user=56.60 sys=0.78, real=56.29 secs] Same problem for G1 CMS 21
  • 22.
    • We arepartners • Best GC algorithm in the world • Our tests show maximum pause time of 2ms • Some overhead compared to HotSpot (8%) • False issue syndrom Zing JVM from Azul? 22
  • 23.
  • 24.
    • few ordersper day, most of the time in idle Why power management matters ? 24
  • 25.
    • CPU embedstechnologies to reduce power usage • Frequency scaling during execution (P states) P0 = nominal frequency • Deep sleep modes during idle phases (C states) C0 = Running, C1 = idle • latency to wakeup cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency 200 (us) Power management technologies 25
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    • Turbo boostis dynamic frequency scaling • Depends on thermal envelop • Reducing cores reduce thermal envelop => higher freq • Some vendors can fix this higher frequency • Example: intel E5 v3 14 cores => 2 cores, 2.6Ghz => 3.6GHz Cores reduction/Fixed Turbo Boost 31
  • 32.
    • Some OSdrivers are also very aggressive # cat /sys/devices/system/cpu/cpuidle/current_driver intel_idle • Disabled in boot kernel parameters: intel_idle.max_cstate=0 Power management at OS level 32
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    • Non-Uniform MemoryAccess • Starts from 2 cpus (sockets) • 1 Memory controller per socket (Node) • Local Memory vs remote memory NUMA 40
  • 41.
    NUMA architecture 41 Socket 0Socket 1 MC DRAM MC DRAM QPI QPI 65ns 65ns40ns
  • 42.
    • Avoid remotememory access • Avoid scheduling and allocation on different node • Bind process on one CPU only (numactl) Restrictive but effective Binding on node 0: numactl --cpunodebind=0 --membind=0 java … • BIOS: node interleaving disabled How to deal with NUMA ? 42
  • 43.
  • 44.
    Cache hierarchy architecture 44 Socket0 Core0 Core1 L2 L1 L3 L2 L1 DRAM
  • 45.
    • CPU L3shared cache across cores • Non critical threads can “pollute” cache • eviction of data required by critical threads • => adds latency to re-access those data • Need to isolate critical threads to non-critical ones Cache pollution on L3 45
  • 46.
    • OS Schedulingcannot work in this case • Need to pin manually threads on cores • Thread affinity allow us to do this • Identify our critical and non-critical threads in application Thread affinity 46
  • 47.
  • 48.
  • 49.
    • Other processescan also pollute L3 cache • Need to isolate all processes away from critical threads • The whole CPU need to be dedicated System processes & cache pollution 49
  • 50.
    • No otherprocesses should be scheduled on the same socket (isolcpus) • But if more than one thread per core need to use cpuset Isolating cores 50 Socket 0 Socket 1 Core0 Core2 Core4 Core6 Core1 Core3 Core5 Core7 Scheduler
  • 51.
    • Make GCpauses predictable • Tune BIOS/OS for maximum performance • Control NUMA effect by binding on one node • Avoid cache pollution with Thread Affinity and CPU isolation Recommendations 51
  • 52.
    • Can optimizea process without changing your code • Know your hardware and your OS • This is Mechanical Sympathy Conclusion 52
  • 53.
    • Mechanical Sympathyblog http://mechanical-sympathy.blogspot.com/ • Mechanical Sympathy forum https://groups.google.com/forum/#!forum/mechanical-sympathy • Java Thread Affinity library https://github.com/peter-lawrey/Java-Thread-Affinity References 53
  • 54.
    Q&A Jean-Philippe BEMPEL @jpbempel PerformanceArchitect http://jpbempel.blogspot.com