BAXTER phase 1b

PERFORMANCE
IMPROVEMENTS REPORT

Goal was:
Deliver short term tactical performance
improvements
● Fix common performance bottlenecks
● Introduce incremental architectural improvements
based on Proof-of-Concept
● => No business logic or radical architecture changes
● => Update software to incorporate modern
programming and architectural standards

Where did we start?
Version 1.9.5 running 8 PG and 8 L
This version cannot run 20PG and 15L
Mean latency: 70+ms (with long tail)
% Messages over 100ms: 49%
20 Currency pairs meant 40 processes running
JVM overload
Inefficient processor utilization – only 40%
Message latency characteristics
● Sub 10ms: 9.64%
● Sub 20ms: 31.69%
● Sub 40ms: 64.17%
● Sub 80ms: 85.34%
Long tail on distribution

Where are we now?
Version PERF running 20 PG and 15 L
Mean latency: 16ms
% Messages over 100ms: 0.029%
20 Currency pairs means 20 processes running
JVM not taxed
All processors utilized: 80%
Message latency characteristics
● Sub 10ms: 15.75%
● Sub 20ms: 77.53%
● Sub 40ms: 96.48%
● Sub 80ms: 99.89%
Comparare to http://www.lmax.com/execution-performance (can’t
guarantee latency 100%)

How did we test?
Instrumented code with Fixprotocol Inter Party Latency
LMP's, and recorded timing info
Run simulated price feed with constant and live-like rates
19 currency pairs
20 price groups (PG) and 15 layers (L) per pair for PERF
branch
8 PG and 8 L for 1.9.5.56 branch
100 spot updates/sec
20 fwd updates/sec

Distribution of throughput(msgs/s)

Performance Improvements
Common Improvements
● Eliminate sources of latency common to many
applications
● While some may have seemed trivial, they had
significant impact
Improvements based on the PoC
● Apply PoC architecture principles in key areas where
latency was measured
● Only tactical changes, not strategic
● Required careful measurements: bottlenecks turned out
to be in different places than previously thought

Common Performance Bottlenecks:
Price Object Marshalling
Replaced object marshalling
● Significant source of latency, large message sizes, and
garbage (object) creation
● Serialize-Deserialize cycle was performed at least three
times for every price
● Previously based on JDK serialization, replaced with
custom code
● Removed one cycle (more on that later)
Optimized the Price Object data structure

Common Performance Bottlenecks:
Logging
Price Engine logging levels were insane
● INFO level logging was performed over 1,500 times
● Most INFO level logging was redundant
Significant performance bottlenecks
● Disk writes, thread contention, object creation (GC)
● Logs could grow to GB size in minutes
Removed all but necessary logging
● Logs will need further work short term...

Code Review and Optimization
All PE code was reviewed for efficiency
● Re-work (tactical) but not re-write (strategic)
Improvements
● Timer scheduling replaced with a more efficient
approach
● Replace synchronization locks with CAS operations when
possible to reduce contention
● Replace inefficient cache access
● Numerous code tweaks

PoC Architectural Principles
Only distribute components when absolutely
necessary
● Challenge the myth that distributed components improve
throughput and latency
Parellelism (threads) may dramatically slow a
system down
● Contrary to old conventional wisdom
● Mechanical Sympathy has challenged this assumption
● Data contention, context switching often leads to data
duplication and GC
● A lot can be done in a single thread

Reduced Parallelism
Significant contention was eliminated in the Broadcast module
Excessive use of “in memory” producer/consumer queues
● Price objects put on queues for margining, forward calculation, and plugin delivery
● Multiple worker consumer threads pull from those queues and process prices
Queues written using synchronization primitives
● Very inefficient
● Contention between producers and consumers (put and take operations)
● Large number of worker threads lead to context switching
Queues were replaced with a highly efficient lock-free buffer
● Uses CAS operations instead of synchronization to dramatically reduce contention
● Only one consumer thread to reduce context switching
We attempted to eliminate buffers and queues altogether
● Make processing synchronous (and therefore remove contention)
● Turned out to be higher latency than using the lock-free buffers
– Likely because business logic is not optimised

Reduced Distribution
Everyone thought the bottleneck was Broadcast
● It turned out that bottlenecks existed in Broadcast, but there were
other equally significant sources of latency...
...Validator and TW
● One Validator process and one TW process per currency pair --> 20
currency pairs = 40 processes!
● Context switching, JMS latency, serialization overhead
Combined Validator and TW into a single processes
● Halved the number of processes
● Removed one serialization cycle
● Greatly simplified system management

BAXTER phase 1b

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BAXTER phase 1b

Similar to BAXTER phase 1b (20)

Recently uploaded

Recently uploaded (20)

BAXTER phase 1b