Millions quotes per second in pure java


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Millions quotes per second in pure java

  1. 1. Millions Quotes Per Second. A story of pure Java market data vendor© 2013, Roman Elizarov, Devexperts
  2. 2. Market Data Rates 10000 000 9000 000 8000 000 7000 000messages per second 6000 000 5000 000 4000 000 3000 000 2000 000 1000 000 0 Основной Основной Основной Основной Основной Основной Основной Основной US Equities, Indexes and Futures OPRA
  3. 3. Market Data Vendor• Process data coming from exchange data feeds - Parse - Normalize• Distribute data to customers - Gather into a single feed - Store and retrieve (for onDemand historical requests) - Serialize and transfer - Scatter to multiple consumers based on actual subscription
  4. 4. dxFeed High Level Picture CME, CBOT, NYMEX, COMEX, ICE Futures U.S., CBOE, TSX, TSXV, MX Chicago ticker plant 10Gbit resilient redundant connectivity infrastructure NYSE, AMEX, NASDAQ, ISE, OPRA, FINRA, PinkSheets New York ticker plant Direct cross-connect Customer connection point SFTI TNS SAVVIS BT Radianz Internet
  5. 5. A Bit of History• Devexperts was founded in 2002 - as an Upscale Financial IT company• QDS project was born in 2003 - to address market data distribution problem - in a high performance-way (initial design goal was 1M mps)• dxFeed service was launched in 2008 - to provide our customers with live market data directly from exchanges, using QDS for distribution• dxFeed API was created on top of QDS in 2009 - to provide an easier customer-facing API and enable 3rd party developers to integrate their code with dxFeed
  6. 6. Threads Portability Community Developers Garbage Collection Libraries and frameworks Backwards-compatibilityRefactoring Type Safety Open source Memory model Reflection Productivity Tools Readability HotSpot JIT Byte-code manipulationSimplicity The most popular language
  7. 7. * Applies to any language
  8. 8. Java object layout String[] • String[] that is filled with some strings in Java header size String [0] header [1] char[] [2] value header [3] hash ... String size „T‟ header „E‟ value „S‟ hash „T‟ ... ...
  9. 9. Memory layout solution• Prefer array-based data-structures to linked ones - Most Java programs get immediate performance boost by replacing all mentions of LinkedList by ArrayList• Use Java arrays or ByteBuffer classes where it matters - They are guaranteed to be contiguous in memory - Layout your data into array manually• That‟s how QDS core is designed - All it critical data structures are rolled onto int[] and Object[]
  10. 10. byte[] vs ByteBuffer• byte[] is always heap-based - Faster for byte-oriented access• ByteBuffer can be both “heap” and “direct” - Be especially careful with direct ByteBuffers - If you don‟t Pool them, you may run out of native memory before Java GC has a chance to run - Can be faster for short-, int- or long- oriented access via get/putXXX methods • But make sure you use native byte order (BIG_ENDIAN is default) - Direct ByteBuffers don‟t need an extra buffer copy when doing input/output with NIO
  11. 11. Measure, measure, measure
  12. 12. The cost of later change is too high
  13. 13. Garbage collection• Makes your code much easier - to design - to debug - to maintain• GC performs really well when - Objects are very short-lived • They are not promoted to old gen • They are reclaimed by high-throughput scavenge GC - Object are very long-lived and are not modified or contain primitives • Scavenge GC does not waste time scanning them
  14. 14. Object allocation• Allocation of small objects is fast - new String() is ~20 bytes on 64bit VM with compressed oops • not counting char[] object inside of it - ~4.5ns per allocation (on 2.6GHz i5)• But becomes slower when you include amortized GC cost• And can become much slower if you - have big static memory footprint - have “medium-lived” objects - have lots of threads (and thus a lot of GC roots and coordination) - use references (java.lang.ref) a lot - mutate your memory a lot, especially references (GC card marking)
  15. 15. Manual memory management• When you would consider manual memory management in native code (custom object pools), consider doing the same in Java• General advise - Pool large objects • They are expensive to be allocated and to be collected by GC - Avoid small objects • Especially “medium-lived” ones • Layout them into arrays if you need store them
  16. 16. Object allocation action plan (1)• Watch the percentage of time your system spends doing GC - -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps - “jconsole” and “jvisualvm” tools show this information - It is available programmatically via GarbageCollectorMXBean • At Devexperts we collect it and report (push) in real-time via MARS (Monitoring and Reporting System) using a dedicated JVMSelfMonitoring plugin • Our support team have alerts configured on high GC % in our systems• Act when it becomes too big
  17. 17. Object allocation action plan (2)• Tune GC to reduce overhead without code changes• Identify places when most of allocations take places and optimize them - Use off-the-shelf Java profilers - Use Devexperts aprof for a full allocation picture at production speed
  18. 18. Object reuse and sharing• Pooling small objects in often a bad idea - Unless you are trying to quickly speed up code that heavily relies on lots of small objects - It‟s better to get rid of small objects altogether • See boxing in performance critical code  get rid of it• But reusing / sharing small objects is great - Strings are typical candidate for data-processing code• Common pitfalls (don‟t do it, unless you fully understand it) - String.intern - WeakReference
  19. 19. Actually, by their char arrays
  20. 20. String I/O• String are often duplicated in memory• Reading any string-denoted data from database, from file, from network – all produces new strings• Where performance matters, reuse strings - For example see StringCache class from - The key method is get(char[]) • You can reuse char[] where data is read • And get an instance of String from cache if it is there
  21. 21. Radical object / reference elimination• Unroll complex objects into arrays - For example, a collection of strings can be represented in a single byte[]• Renumber shared object instances - Represent string reference as int - That‟s what QDS core does for efficient String manipulation • Faster to compare • Faster to hash • Avoids slower “modify reference” operations (marks GC cards) - But requires hand-crafted memory management • QDS does reference counting, but custom GC is also feasible
  22. 22. Hardcore optimization• Use sun.misc.Unsafe when everything else fails - It gives you full native speed - But no range checks nor type-safety • You are on your own! - Good fit for integration with native data structures when needed• QDS core uses it in few places - Mainly to provide wait-free execution guarantees with an appropriate synchronization for array-based data structures - But there is a fallback code for cases when sun.misc.Unsafe is not available
  23. 23. Even more hardcore – hand-written SMT• If you have to use linked data structures - Consider traversing multiple linked lists simultaneously in the same thread - Akin to hardware SMT, but in software - The code becomes much more complicated - But the performance can considerably increase * Not a Java-specific optimization, but fun to mention here
  24. 24. Threads and scalability• Share data across the threads to further reduce memory footprint - But carefully design and implement this sharing• Learn and love Java Memory Model - It makes your correctly-synchronized multi-threaded code fully portable across CPU architectures• QDS core is a thread-safe data structure with a mix of lock- free, fine-grained and coarse-grained locking approaches which makes it vertically scalable
  25. 25. Be careful with threads and locks• Thread switches introduce a considerable latency (~20us) 1. Enter Lock• Lock contention forces even 2. Context Switch more thread switches 3. Try to lock• It is not a Java-specific 4. Context Switch 5. Exit Lock concern, but a common Java- 6. Context switch specific problem, since Java and enter lock makes threads easier for programmers to use (and many do use them)
  26. 26. Data flow for horizontal scalability Subscribes: IBM, GE. QQQQ, MSFT, INTC, SPX IBM, GE ticks Multiplexor QDTicker GE ticks IBM, GE ticks Subscibes: Subscibes: IBM, GE, QQQQ, MSFT GE, INTC, SPX QDTicker QDTicker IBM GE GE SPX MSFT IBM INTC INTC QQQQ
  27. 27. HotSpot Server VM• Run “java -server” (it is a default on server-class machines)• Does - Very deep code inlining - Loop unrolling - Optimize virtual and interface calls based on collected profile - Escape analysis for synchronization and allocation elimination• Embrace it! - Don‟t fear writing your code in a nice object-oriented way • In most of cases, that is • Do still avoid too much “object orientation” in the most performance-sensitive places
  28. 28. HotSpot challenges• It is harder to profile, stress-test, and tune code - You need to “warm up” the code to get meaningful result - Small changes in code can lead to big differences that are hard to explain - Compilation of less busy code can trigger at any time and cause unexpected latency spikes• Don‟t do micro-tests - Test the whole system together instead• Do micro-tests - To learn which code patters are better across the board - Small savings add up
  29. 29. Looking at generated assembly code• -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*<class-name>.<method-name> -XX:PrintAssemblyOptions=intel• You will need “hsdis” library added to your JRE/JDK with the actual disassembler code - But you have to build it yourself:
  30. 30. Use native profilers• Java profiles are great tools, but they don‟t use processor performance counters and lack the ability to recognize such problems like memory pressure - And they don‟t always produce a clear picture - All “cpu time” is reported at the nearest “safe point”, not at the actual code line that consumed CPU• Use native profilers to figure it out - Sun Studio Performance Analyzer - Intel VTune Amplifier - AMD CodeAnalyst
  31. 31. General (1)• Classic data structures and algorithms - Use CPU and memory efficient data structures and algorithms - Know and love hash tables • They are the most useful data structure in a typical business application• Lock-free data structures will help you to scale vertically• Every byte counts. Remember about bytes. - QDS core compactly represents data as 4-byte integers while working with them in memory - QDS uses compact byte-level compression on the wire - Even more compact bit-level compression is used in long-term store
  32. 32. General (2)• Burst handling - Process data in batches to amortize batch overhead across messages - QDS increases batch size under load to decrease overhead• Architecture - Use layers - Lower layers of architectures should generally be used in more places and be more optimized - The outer layer, dxFeed API, is the easies one to use and understand and most object-oriented, but less optimized
  33. 33. Architecture layers JS API dxFeed API Tools Gateways QDS Core Transport Protocol ZLIB SSL Sockets NIO Files, etc
  34. 34. QDS API (1)print quote bid/ask on the screen
  35. 35. QDS API (2)
  36. 36. QDS API Summary• Pros - High-performance design - Flexible (can be used in various ways) • QDS Multiplexor is an application on top of QDS API • As well as all other command-line QDS tools - Extensible with clear separation of interfaces and implementation• Cons - Verbose, lots of code to do simple things - Error-prone (easy to get wrong and to introduce subtle bugs)• Everybody needs Quote, Trade, etc with easy-to-use API - Hence, dxFeed API was born
  37. 37. dxFeed APIprint quote bid/ask on the screen
  38. 38. Contact me by email: elizarov at