PERFORMANCEIMPROVEMENTS REPORT
Goal was:Deliver short term tactical performanceimprovements● Fix common performance bottlenecks● Introduce incremental ar...
Where did we start?Version 1.9.5 running 8 PG and 8 LThis version cannot run 20PG and 15LMean latency: 70+ms (with long ta...
Where are we now?Version PERF running 20 PG and 15 LMean latency: 16ms% Messages over 100ms: 0.029%20 Currency pairs means...
How did we test?Instrumented code with Fixprotocol Inter Party LatencyLMPs, and recorded timing infoRun simulated price fe...
Latency Distribution (ms)
Distribution of throughput(msgs/s)
Performance ImprovementsCommon Improvements● Eliminate sources of latency common to manyapplications● While some may have ...
Common Performance Bottlenecks:Price Object MarshallingReplaced object marshalling● Significant source of latency, large m...
Common Performance Bottlenecks:LoggingPrice Engine logging levels were insane● INFO level logging was performed over 1,500...
Code Review and OptimizationAll PE code was reviewed for efficiency● Re-work (tactical) but not re-write (strategic)Improv...
PoC Architectural PrinciplesOnly distribute components when absolutelynecessary● Challenge the myth that distributed compo...
Reduced ParallelismSignificant contention was eliminated in the Broadcast moduleExcessive use of “in memory” producer/cons...
Reduced DistributionEveryone thought the bottleneck was Broadcast● It turned out that bottlenecks existed in Broadcast, bu...
Upcoming SlideShare
Loading in …5
×

BAXTER phase 1b

307 views

Published on

Performance improvement drive, tactical phase.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
307
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BAXTER phase 1b

  1. 1. PERFORMANCEIMPROVEMENTS REPORT
  2. 2. Goal was:Deliver short term tactical performanceimprovements● Fix common performance bottlenecks● Introduce incremental architectural improvementsbased on Proof-of-Concept● => No business logic or radical architecture changes● => Update software to incorporate modernprogramming and architectural standards
  3. 3. Where did we start?Version 1.9.5 running 8 PG and 8 LThis version cannot run 20PG and 15LMean latency: 70+ms (with long tail)% Messages over 100ms: 49%20 Currency pairs meant 40 processes runningJVM overloadInefficient processor utilization – only 40%Message latency characteristics● Sub 10ms: 9.64%● Sub 20ms: 31.69%● Sub 40ms: 64.17%● Sub 80ms: 85.34%Long tail on distribution
  4. 4. Where are we now?Version PERF running 20 PG and 15 LMean latency: 16ms% Messages over 100ms: 0.029%20 Currency pairs means 20 processes runningJVM not taxedAll processors utilized: 80%Message latency characteristics● Sub 10ms: 15.75%● Sub 20ms: 77.53%● Sub 40ms: 96.48%● Sub 80ms: 99.89%Comparare to http://www.lmax.com/execution-performance (can’tguarantee latency 100%)
  5. 5. How did we test?Instrumented code with Fixprotocol Inter Party LatencyLMPs, and recorded timing infoRun simulated price feed with constant and live-like rates19 currency pairs20 price groups (PG) and 15 layers (L) per pair for PERFbranch8 PG and 8 L for 1.9.5.56 branch100 spot updates/sec20 fwd updates/sec
  6. 6. Latency Distribution (ms)
  7. 7. Distribution of throughput(msgs/s)
  8. 8. Performance ImprovementsCommon Improvements● Eliminate sources of latency common to manyapplications● While some may have seemed trivial, they hadsignificant impactImprovements based on the PoC● Apply PoC architecture principles in key areas wherelatency was measured● Only tactical changes, not strategic● Required careful measurements: bottlenecks turned outto be in different places than previously thought
  9. 9. Common Performance Bottlenecks:Price Object MarshallingReplaced object marshalling● Significant source of latency, large message sizes, andgarbage (object) creation● Serialize-Deserialize cycle was performed at least threetimes for every price● Previously based on JDK serialization, replaced withcustom code● Removed one cycle (more on that later)Optimized the Price Object data structure
  10. 10. Common Performance Bottlenecks:LoggingPrice Engine logging levels were insane● INFO level logging was performed over 1,500 times● Most INFO level logging was redundantSignificant performance bottlenecks● Disk writes, thread contention, object creation (GC)● Logs could grow to GB size in minutesRemoved all but necessary logging● Logs will need further work short term...
  11. 11. Code Review and OptimizationAll PE code was reviewed for efficiency● Re-work (tactical) but not re-write (strategic)Improvements● Timer scheduling replaced with a more efficientapproach● Replace synchronization locks with CAS operations whenpossible to reduce contention● Replace inefficient cache access● Numerous code tweaks
  12. 12. PoC Architectural PrinciplesOnly distribute components when absolutelynecessary● Challenge the myth that distributed components improvethroughput and latencyParellelism (threads) may dramatically slow asystem down● Contrary to old conventional wisdom● Mechanical Sympathy has challenged this assumption● Data contention, context switching often leads to dataduplication and GC● A lot can be done in a single thread
  13. 13. Reduced ParallelismSignificant contention was eliminated in the Broadcast moduleExcessive use of “in memory” producer/consumer queues● Price objects put on queues for margining, forward calculation, and plugin delivery● Multiple worker consumer threads pull from those queues and process pricesQueues written using synchronization primitives● Very inefficient● Contention between producers and consumers (put and take operations)● Large number of worker threads lead to context switchingQueues were replaced with a highly efficient lock-free buffer● Uses CAS operations instead of synchronization to dramatically reduce contention● Only one consumer thread to reduce context switchingWe attempted to eliminate buffers and queues altogether● Make processing synchronous (and therefore remove contention)● Turned out to be higher latency than using the lock-free buffers– Likely because business logic is not optimised
  14. 14. Reduced DistributionEveryone thought the bottleneck was Broadcast● It turned out that bottlenecks existed in Broadcast, but there wereother equally significant sources of latency......Validator and TW● One Validator process and one TW process per currency pair --> 20currency pairs = 40 processes!● Context switching, JMS latency, serialization overheadCombined Validator and TW into a single processes● Halved the number of processes● Removed one serialization cycle● Greatly simplified system management

×