SlideShare a Scribd company logo
Microseconds matter in High Frequency
Trading
High performance trading systems in C++
Ravi Parikh TWO ROADS TRADING PVT LTD
(http://tworoads-trading.co.in/)
2
Introduction
About me :
●
HFT infra developer for TWO ROADS TRADING since 2011
●
Overall close to 9 years of experience in software development
Today’s talk :
●
General Software Development vs HFT software development
●
Overview of HFT trading and why does speed matter a lot ?
●
Importance of Correctness / Robustness of the HFT systems
●
A few Techniques for C++ Optimizations for ultra low latency software development
●
Noisy neighbors
●
Measurements of performance
3
General Software Development
Source : RFE Electronics
4
HFT Software Development
Source : RFE Electronics
5
HFT Trading
➔
Trading in general is about buying something and selling it, can result into profit or loss
based on prices at which it was bought & sold.
➔
HFT trading is about market making and there is no genuine intention of buying /
selling, They aren’t speculators and they are there to provide liquidity to the market
➔
HFT makes money from very small profitable trades executed at very high frequency
( the holding time period for any open trade is very small )
➔
Other main objective is to avoid taking bad trades which can result into larger losses.
➔
So what is the role of ultra low latency system in HFT ? It’s about spotting the
opportunity for those quick small profitable trades and grabbing those, at the same time
it’s about pulling it out in time to avoid taking larger losses ( After all, you’ll always have
a very stiff competition fighting for the same trades given markets are becoming more
and more efficient each passing day )
6
Role Of Latency In Grabbing The
Opportunity
- Against all odds only the fastest few will be able to book tickets successfully !!
-
Source : Internet
7
Role of Latency In Pulling Out ! ( Avoid
bad trades )
- It’s equally important to pull out of a bad trade before someone hits you with the fill
( It’s those trade where you were slow to change the price and now you were forced to
take the trade which you know is a bad one, So speed matters even when you want to
avoid making a loss )
Source : Two Players Org
8
So How Fast Is Fast Enough ?
- Doesn’t matter if you’re faster by 1 sec / 1 nano sec long as you’re ahead of everyone
else ( Unfortunately in HFT trading domain in most cases there are no silver & bronze
rewards, It’s gold or nothing or even worse which is a loss )
Source : Photo By Alvin Loke
Source : Two Players OrgSource : Two Players Org
9
HFT System Overview ( T2T )
Software Solutions : 1-10 micros
Hardware Solutions : 0.5-2 micros
10
Robustness ??
●
There is always a trade-off between putting that extra if checks against saving
a few cpu cycles, so robustness and optimizations don’t always go well together.
●
Even though not being put forward as the most critical feature in the system
ahead of speed, robustness can never be compromised in HFT
●
An opportunity of making a 1 Rs profit from buying / selling a stock in a 5 micro
latency system at each uniform points doesn’t guarantee in all profitable trades
because we will not be able to capture all opportunities, However a BUG in the
system ( trivial it may be say buy/sell flipped ! ) it will guarantee a loss of 12
Million INR in ONE MINUTE
●
So in HFT infra development, Safety is always first, one has to be 120% sure
that there are no bugs in the system which will run in Production because all it
may take is just a few seconds / minutes of buggy run and it can make headlines
the next day.
●
So now keeping in mind that you can’t do anything against robustness making
an application work faster becomes even more challenging and interesting.
11
Optimizations ( Prerequisites )
●
Hardware selection ( CPU / RAM / CACHE )
●
Network selection ( Switches / Network Adapters )
●
Understanding of OS/Platform ( OS version, OS / kernel features, OS memory
management, Interrupts Management etc )
●
Programming Language Selection – Why C++ ?
●
Compiler / Linker ( Compiler features / compiling options / type of compiler etc )
●
External libraries ( Dependencies / Features )
●
Various Tools For Debugging / Profiling ( GDB, valgrind, cachegrind, gprof etc )
It is simply not possible to improve T2T in HFT even with the logically
most optimized C++ code unless one understands the environment
under which that C++ code is eventually going to interact / run on.
12
External Optimizations ( Hardware )
●
CPU Processor
●
RAM
●
Different Types of Cache and Cache Sizes
●
How do you pick the correct combination ?
●
Network Adapter ( Kernel Bypassing )
●
OS Tuning ( Context Switches ? Interrupts Binding ? )
13
Fine Tuned System Performance
Source : CPPCON ( Carl Cook )
14
C++ Coding Optimizations
A Few Techniques That We’ll Talk About :
●
Where do we start ? What is the hot path ?
●
Logging is essential isn’t it ? What do we do then ?
●
Dynamic Memory Management ( New / Delete )
●
Data Binding
●
Strings
●
Inline ( always_inline / noinline )
●
Branching ( What are the issues ? )
Disclaimer : I’ve not covered all typical C++ optimizations, it’ just a few quick techniques
which can make significant difference to the performance.
15
Where to start ? Hot Path
●
The “hotpath” is the full path through which the execution flows and it does
the actual end transaction, in HFT it’s the T2T path
●
The “hotpath” is only exercised 0.01% of the time – the rest of the time the
system is idle or doing administrative work or is waiting for events
●
OS, Networks and Hardware are focused on throughput and fairness
●
Jitter is totally unacceptable – This is the major source of bad trades and
forces one to move to total hardware solution even though the Median number
might actually get worse
16
Removing Jitters From Hotpath
Source : CPPCON ( Carl Cook )
17
HOTPATH in HFT System
Source : CPPCON ( Carl Cook )
18
Solution ?
Source : CPPCON ( Carl Cook )
19
Logging
●
Almost all production systems will need to log some important data
●
Disk I/O is the worst of all hardware operations in terms of performance
●
if your C++ code logs too much then it’s most of the time busy doing Disk I/O and
consuming CPU for unproductive work, First try to minimize the logging to an extent
possible, remove it out of hotpath, use compressed forms of data etc
What are other options ?
20
Offload Logging
●
Move logging to custom handles rather than std::cout / std::cerr / printf, Introduce
buffering on your handle ( I.e create a buffer of 1024 bytes and only flush it when
required )
●
Standard streams are also buffered unless we flush it, but with custom handles we can
better control when to flush and can design it to work better with the type of logging we
have
●
Completely get rid of logging from your production system to eliminate jitters, One can
write the required information in some format in say MQ / SHM and then it can be
offloaded to log into files via a completely separate process, This will improve the latency
significantly for the production system.
21
Dynamic Memory Management
●
There will always be cases when Production system will make use of heap memory and
use objects on the fly ( with new and delete )
●
If your c++ code makes use of new / delete / malloc etc then what are the issues in
terms of latency ?
What are the alternatives to improve ?
22
Memory Pool
●
New / Delete are system calls and the control will flow through kernel space / libs
●
Delete code in glibc is actually a 400 line of bookkeeping which will eat up a lot of CPU
cycles
●
The solution here would be to develop your own C++ class which takes care of memory
management for the duration of your program. We can initially allocate a pool of objects
in a class and instead of using new / delete, We can use this class to assign / release
objects, This way we can actually avoid kernel space execution and improve latency as
well as jitters
●
Another bonus advantage will be we will run into recently used objects very frequently
and hence improve cache performance.
23
Data Binding
●
How many bytes are read when some_function is called ?
●
What is the problem with data access here ?
How do we fix the issue here ?
24
Cache Binding / Cache Line Usage
●
Binding the data very closely will help benefit improve cache access
●
In this case, you’d get an access to other variables of the arguments at zero
cost
●
You can design your code in a way to optimize the usage of cache lines
25
Strings
●
We do like the C++ strings and use it extensively. But you may be surprised to realize
how slower they get executed when put under performance stress testing.
●
There are a lot of standard studies which have been done on char array vs strings and in
general the strings are slower compared to char array by around 23% !!
●
Eventually the CPU processor / OS works best when they get to deal with only 1s and 0s,
When you ask it to a string comparison or char array comparison, it tries to do the
comparison in generic way ( I.e goes on comparison each character and stops at end of
string / a mismatch ), So this becomes a problem for latency as it’s a linear search and
even it takes 50-60 cycles in isolation for say 16 char comparison, a usage of strings at
20 places in the code will take 1200 cycles ( ~0.4 micros on 3GHz !! )
Solution ?
26
Avoid String Operations When Possible
●
We can implement a simpler solution when we know in most cases the length of string is
fixed or it can vary as well by type casting, the latency of comparison will drop by 38% at
least
Length 8 char array comparison,
(uint64_t)(arr_a) == (uint64_t)(arr_b)
Length 16 char array comparison can be done as below,
*((uint64_t*)(arr_a)) == *((uint64_t*)(arr_b)) &&
*((uint64_t*)(arr_a+8)) == *((uint64_t*)(arr_b+8))
This will get executed faster now because the processor is only looking to match all bits
and in 64 bit system it’s just a single word bitwise comparison.
27
Inline
●
What is inline keyword ?
●
When happens when the execution reaches a function call ?
●
Why not inline everything ?
●
Why doesn’t compiler expand everything ?
28
always_inline and noinline
●
inline word has been slightly misunderstood – It mainly means multiple definitions are
permitted ( i.e a common header with definition is included into 2 cpp )
●
always_inline and noinline are stronger hints to the compiler but one has to measure the
latency impact when using it.
●
Why doesn’t compiler expand everything in place ?
- DLL
- Virtual functions
- Recursive function call
- Bigger executable means more disk space and load time, also puts pressure on cache
You can in general try to hint compiler not to inline small functions which are not doing
anything productive or should be out of the hotpath.
__atrribute__((noinline))
void some_function () { // Not doing anything useful}
29
Branching
●
Why is branching bad ?
i.e Consider I can buy / sell something and at multiple places through my execution code
I’ve checks like,
if( BUY == activity_type ) {
}else if( SELL == activity_type ) {
}else { //ERROR }
●
What are the options we have ?
30
Branching Effects
Source : Image by Mecanismo ( CC-By-SA 3.0)
31
Branch Prediction
●
Consider an if statement shown above : At the processor level, it’s actually a branch
instruction ! ( Assume : data[c] is between 0 – 255 values, c is a counter which is looping over
the array )
●
Processors are smart to prefetch a set of instruction to speed up the execution time
●
Your processor sees a branch and it has no idea which way it will go – what it will have to do
is halt the execution and wait until the previous instructions are complete and it can pick the
correct path
●
Modern processors are quite complicated and they have long pipelines, So they take forever
to “warm up” and “slow down”
●
What are the alternatives – develop your code which is friendly enough for branch prediction
to work ( i.e If possible sort the array, will improve branch prediction )
●
Apply some smart hacks with assumptions which are valid ( No Branching with below
replacement and your train never has to stop here )
int t = ( data[c] – 128 ) >> 31 ;
sum += ~t & data[c] ;
●
32
Further Branching Improvments
Source : CPPCON ( Carl Cook )
33
Noisy Neighbours
34
Noisy Neighbours Solution
●
You have to be very careful in choosing which all processes run on the system.
●
Which processes are actually sharing the L2 cache
●
Identify if there is any process messing up with L3 cache which is impacting the
performance of production application in turn
●
One can actually disable cores which are not being used to sort of lock the cache,
disable hyper-threading to ensure better use of L2 cache
●
There are various hacks available to control some kernel modules to not cache data and
rather actually make use of RAM
35
Performance Measurements
Challenges :
●
How do you measure very micro blocks of code where the mere measurement itself may
be taking more time, gettimeofday in linux with tsc clock kernel itself takes ~120/150
cpu cycles.
●
Measurement in an offline setup will be far far away from the one observed in
Production system
●
How do you analyze which are the slow performing units of the code ?
●
Do you actually try to take a look at some assembly code and how useful it is in
practical scenarios ?
●
How useful are the tools like cachegrind, gprof, papi libs with counters ?
36
Measurements of HFT system
performance
Source : CPPCON ( Carl Cook )
37
Talk of the town – FPGA
●
This is the current area of focus for most of the HFT firms now a days.
●
A pure end – end FPGA solution is quite complex and requires lot of time and effort
●
Not everything can be optimized in FPGA since at present most FPGA boards operate at
around 2/2.5 GHz.
●
A lot of the firms like us are trying develop a hybrid end – end solution in FPGA where we
can retain best of the software and hardware.
●
The primary motivation to move to FPGA is remove jitters from the system, no software
solution can offer as good stability in latency as hardware can. The major concern here is
no one wants to be slow even during 1% of the time under which the application is
trading. You can be fastest to make money 99% of the time but jitters can wipe it all away
!!
Questions ??
38
THANK YOU !
Contact :
ravi.parikh@tworoads-trading.co.in

More Related Content

What's hot

ARM
ARMARM
ARM Micro-controller
ARM Micro-controllerARM Micro-controller
ARM Micro-controller
Ravikumar Tiwari
 
Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit ControllerOverview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
Premier Farnell
 
Microcontroller part 2
Microcontroller part 2Microcontroller part 2
Microcontroller part 2
Keroles karam khalil
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
Hossam Adel
 
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberArm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furber
asodariyabhavesh
 
ARM7-ARCHITECTURE
ARM7-ARCHITECTURE ARM7-ARCHITECTURE
ARM7-ARCHITECTURE
Dr.YNM
 
ARM7TDM
ARM7TDMARM7TDM
ARM7TDM
Ramasubbu .P
 
Verifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami MakelaVerifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami Makela
Cyber Fund
 
Introduction to ARM
Introduction to ARMIntroduction to ARM
Introduction to ARM
Puja Pramudya
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overviewSunil Thorat
 
Ppt
PptPpt
Ppt
Bala Ji
 
Lec07
Lec07Lec07
Pic16f84
Pic16f84Pic16f84
Arm architecture
Arm architectureArm architecture
Arm architecture
MinYeop Na
 
Introduction to microcontrollers
Introduction to microcontrollersIntroduction to microcontrollers
Introduction to microcontrollersCorrado Santoro
 
Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...
rahul kumar verma
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
mike parks
 

What's hot (20)

ARM
ARMARM
ARM
 
ARM Micro-controller
ARM Micro-controllerARM Micro-controller
ARM Micro-controller
 
Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit ControllerOverview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
 
Microcontroller part 2
Microcontroller part 2Microcontroller part 2
Microcontroller part 2
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberArm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furber
 
ARM7-ARCHITECTURE
ARM7-ARCHITECTURE ARM7-ARCHITECTURE
ARM7-ARCHITECTURE
 
ARM7TDM
ARM7TDMARM7TDM
ARM7TDM
 
Arm architechture
Arm architechtureArm architechture
Arm architechture
 
Verifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami MakelaVerifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami Makela
 
Introduction to ARM
Introduction to ARMIntroduction to ARM
Introduction to ARM
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Ppt
PptPpt
Ppt
 
Lec07
Lec07Lec07
Lec07
 
Pic16f84
Pic16f84Pic16f84
Pic16f84
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
Introduction to microcontrollers
Introduction to microcontrollersIntroduction to microcontrollers
Introduction to microcontrollers
 
Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
 

Similar to Presentation

Arm developement
Arm developementArm developement
Arm developement
hirokiht
 
Phytium 64 core cpu preview
Phytium 64 core cpu previewPhytium 64 core cpu preview
Phytium 64 core cpu preview
inside-BigData.com
 
Java under the hood
Java under the hoodJava under the hood
Java under the hood
Vachagan Balayan
 
Technical Implementation: Hardware
Technical Implementation: HardwareTechnical Implementation: Hardware
Technical Implementation: Hardware
Forrester High School
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing Python
AdimianBE
 
AVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontrollerAVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontroller
Mohamed Ali
 
Micro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application DevelopmentMicro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application Development
Emertxe Information Technologies Pvt Ltd
 
Assembly programming
Assembly programmingAssembly programming
Assembly programming
Omar Sanchez
 
Basic 8051 question
Basic 8051 questionBasic 8051 question
Basic 8051 question
Sourabh Bhattacharya
 
TMS320C5x
TMS320C5xTMS320C5x
Let’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllLet’s Fix Logging Once and for All
Let’s Fix Logging Once and for All
ScyllaDB
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
Jeff Larkin
 
Embedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 CourseEmbedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 Course
FastBit Embedded Brain Academy
 
Ppt on embedded system
Ppt on embedded systemPpt on embedded system
Ppt on embedded system
Pankaj joshi
 
Ds03 part i algorithms by jyoti lakhani
Ds03 part i algorithms   by jyoti lakhaniDs03 part i algorithms   by jyoti lakhani
Ds03 part i algorithms by jyoti lakhani
jyoti_lakhani
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
Priyanka Aash
 
Micro controller selection
Micro controller selectionMicro controller selection
Micro controller selectionVijay kumar
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systems
Sudhanshu Janwadkar
 

Similar to Presentation (20)

Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Arm developement
Arm developementArm developement
Arm developement
 
Phytium 64 core cpu preview
Phytium 64 core cpu previewPhytium 64 core cpu preview
Phytium 64 core cpu preview
 
Java under the hood
Java under the hoodJava under the hood
Java under the hood
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Technical Implementation: Hardware
Technical Implementation: HardwareTechnical Implementation: Hardware
Technical Implementation: Hardware
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing Python
 
AVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontrollerAVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontroller
 
Micro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application DevelopmentMicro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application Development
 
Assembly programming
Assembly programmingAssembly programming
Assembly programming
 
Basic 8051 question
Basic 8051 questionBasic 8051 question
Basic 8051 question
 
TMS320C5x
TMS320C5xTMS320C5x
TMS320C5x
 
Let’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllLet’s Fix Logging Once and for All
Let’s Fix Logging Once and for All
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Embedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 CourseEmbedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 Course
 
Ppt on embedded system
Ppt on embedded systemPpt on embedded system
Ppt on embedded system
 
Ds03 part i algorithms by jyoti lakhani
Ds03 part i algorithms   by jyoti lakhaniDs03 part i algorithms   by jyoti lakhani
Ds03 part i algorithms by jyoti lakhani
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
Micro controller selection
Micro controller selectionMicro controller selection
Micro controller selection
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systems
 

Recently uploaded

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 

Recently uploaded (20)

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 

Presentation

  • 1. Microseconds matter in High Frequency Trading High performance trading systems in C++ Ravi Parikh TWO ROADS TRADING PVT LTD (http://tworoads-trading.co.in/)
  • 2. 2 Introduction About me : ● HFT infra developer for TWO ROADS TRADING since 2011 ● Overall close to 9 years of experience in software development Today’s talk : ● General Software Development vs HFT software development ● Overview of HFT trading and why does speed matter a lot ? ● Importance of Correctness / Robustness of the HFT systems ● A few Techniques for C++ Optimizations for ultra low latency software development ● Noisy neighbors ● Measurements of performance
  • 5. 5 HFT Trading ➔ Trading in general is about buying something and selling it, can result into profit or loss based on prices at which it was bought & sold. ➔ HFT trading is about market making and there is no genuine intention of buying / selling, They aren’t speculators and they are there to provide liquidity to the market ➔ HFT makes money from very small profitable trades executed at very high frequency ( the holding time period for any open trade is very small ) ➔ Other main objective is to avoid taking bad trades which can result into larger losses. ➔ So what is the role of ultra low latency system in HFT ? It’s about spotting the opportunity for those quick small profitable trades and grabbing those, at the same time it’s about pulling it out in time to avoid taking larger losses ( After all, you’ll always have a very stiff competition fighting for the same trades given markets are becoming more and more efficient each passing day )
  • 6. 6 Role Of Latency In Grabbing The Opportunity - Against all odds only the fastest few will be able to book tickets successfully !! - Source : Internet
  • 7. 7 Role of Latency In Pulling Out ! ( Avoid bad trades ) - It’s equally important to pull out of a bad trade before someone hits you with the fill ( It’s those trade where you were slow to change the price and now you were forced to take the trade which you know is a bad one, So speed matters even when you want to avoid making a loss ) Source : Two Players Org
  • 8. 8 So How Fast Is Fast Enough ? - Doesn’t matter if you’re faster by 1 sec / 1 nano sec long as you’re ahead of everyone else ( Unfortunately in HFT trading domain in most cases there are no silver & bronze rewards, It’s gold or nothing or even worse which is a loss ) Source : Photo By Alvin Loke Source : Two Players OrgSource : Two Players Org
  • 9. 9 HFT System Overview ( T2T ) Software Solutions : 1-10 micros Hardware Solutions : 0.5-2 micros
  • 10. 10 Robustness ?? ● There is always a trade-off between putting that extra if checks against saving a few cpu cycles, so robustness and optimizations don’t always go well together. ● Even though not being put forward as the most critical feature in the system ahead of speed, robustness can never be compromised in HFT ● An opportunity of making a 1 Rs profit from buying / selling a stock in a 5 micro latency system at each uniform points doesn’t guarantee in all profitable trades because we will not be able to capture all opportunities, However a BUG in the system ( trivial it may be say buy/sell flipped ! ) it will guarantee a loss of 12 Million INR in ONE MINUTE ● So in HFT infra development, Safety is always first, one has to be 120% sure that there are no bugs in the system which will run in Production because all it may take is just a few seconds / minutes of buggy run and it can make headlines the next day. ● So now keeping in mind that you can’t do anything against robustness making an application work faster becomes even more challenging and interesting.
  • 11. 11 Optimizations ( Prerequisites ) ● Hardware selection ( CPU / RAM / CACHE ) ● Network selection ( Switches / Network Adapters ) ● Understanding of OS/Platform ( OS version, OS / kernel features, OS memory management, Interrupts Management etc ) ● Programming Language Selection – Why C++ ? ● Compiler / Linker ( Compiler features / compiling options / type of compiler etc ) ● External libraries ( Dependencies / Features ) ● Various Tools For Debugging / Profiling ( GDB, valgrind, cachegrind, gprof etc ) It is simply not possible to improve T2T in HFT even with the logically most optimized C++ code unless one understands the environment under which that C++ code is eventually going to interact / run on.
  • 12. 12 External Optimizations ( Hardware ) ● CPU Processor ● RAM ● Different Types of Cache and Cache Sizes ● How do you pick the correct combination ? ● Network Adapter ( Kernel Bypassing ) ● OS Tuning ( Context Switches ? Interrupts Binding ? )
  • 13. 13 Fine Tuned System Performance Source : CPPCON ( Carl Cook )
  • 14. 14 C++ Coding Optimizations A Few Techniques That We’ll Talk About : ● Where do we start ? What is the hot path ? ● Logging is essential isn’t it ? What do we do then ? ● Dynamic Memory Management ( New / Delete ) ● Data Binding ● Strings ● Inline ( always_inline / noinline ) ● Branching ( What are the issues ? ) Disclaimer : I’ve not covered all typical C++ optimizations, it’ just a few quick techniques which can make significant difference to the performance.
  • 15. 15 Where to start ? Hot Path ● The “hotpath” is the full path through which the execution flows and it does the actual end transaction, in HFT it’s the T2T path ● The “hotpath” is only exercised 0.01% of the time – the rest of the time the system is idle or doing administrative work or is waiting for events ● OS, Networks and Hardware are focused on throughput and fairness ● Jitter is totally unacceptable – This is the major source of bad trades and forces one to move to total hardware solution even though the Median number might actually get worse
  • 16. 16 Removing Jitters From Hotpath Source : CPPCON ( Carl Cook )
  • 17. 17 HOTPATH in HFT System Source : CPPCON ( Carl Cook )
  • 18. 18 Solution ? Source : CPPCON ( Carl Cook )
  • 19. 19 Logging ● Almost all production systems will need to log some important data ● Disk I/O is the worst of all hardware operations in terms of performance ● if your C++ code logs too much then it’s most of the time busy doing Disk I/O and consuming CPU for unproductive work, First try to minimize the logging to an extent possible, remove it out of hotpath, use compressed forms of data etc What are other options ?
  • 20. 20 Offload Logging ● Move logging to custom handles rather than std::cout / std::cerr / printf, Introduce buffering on your handle ( I.e create a buffer of 1024 bytes and only flush it when required ) ● Standard streams are also buffered unless we flush it, but with custom handles we can better control when to flush and can design it to work better with the type of logging we have ● Completely get rid of logging from your production system to eliminate jitters, One can write the required information in some format in say MQ / SHM and then it can be offloaded to log into files via a completely separate process, This will improve the latency significantly for the production system.
  • 21. 21 Dynamic Memory Management ● There will always be cases when Production system will make use of heap memory and use objects on the fly ( with new and delete ) ● If your c++ code makes use of new / delete / malloc etc then what are the issues in terms of latency ? What are the alternatives to improve ?
  • 22. 22 Memory Pool ● New / Delete are system calls and the control will flow through kernel space / libs ● Delete code in glibc is actually a 400 line of bookkeeping which will eat up a lot of CPU cycles ● The solution here would be to develop your own C++ class which takes care of memory management for the duration of your program. We can initially allocate a pool of objects in a class and instead of using new / delete, We can use this class to assign / release objects, This way we can actually avoid kernel space execution and improve latency as well as jitters ● Another bonus advantage will be we will run into recently used objects very frequently and hence improve cache performance.
  • 23. 23 Data Binding ● How many bytes are read when some_function is called ? ● What is the problem with data access here ? How do we fix the issue here ?
  • 24. 24 Cache Binding / Cache Line Usage ● Binding the data very closely will help benefit improve cache access ● In this case, you’d get an access to other variables of the arguments at zero cost ● You can design your code in a way to optimize the usage of cache lines
  • 25. 25 Strings ● We do like the C++ strings and use it extensively. But you may be surprised to realize how slower they get executed when put under performance stress testing. ● There are a lot of standard studies which have been done on char array vs strings and in general the strings are slower compared to char array by around 23% !! ● Eventually the CPU processor / OS works best when they get to deal with only 1s and 0s, When you ask it to a string comparison or char array comparison, it tries to do the comparison in generic way ( I.e goes on comparison each character and stops at end of string / a mismatch ), So this becomes a problem for latency as it’s a linear search and even it takes 50-60 cycles in isolation for say 16 char comparison, a usage of strings at 20 places in the code will take 1200 cycles ( ~0.4 micros on 3GHz !! ) Solution ?
  • 26. 26 Avoid String Operations When Possible ● We can implement a simpler solution when we know in most cases the length of string is fixed or it can vary as well by type casting, the latency of comparison will drop by 38% at least Length 8 char array comparison, (uint64_t)(arr_a) == (uint64_t)(arr_b) Length 16 char array comparison can be done as below, *((uint64_t*)(arr_a)) == *((uint64_t*)(arr_b)) && *((uint64_t*)(arr_a+8)) == *((uint64_t*)(arr_b+8)) This will get executed faster now because the processor is only looking to match all bits and in 64 bit system it’s just a single word bitwise comparison.
  • 27. 27 Inline ● What is inline keyword ? ● When happens when the execution reaches a function call ? ● Why not inline everything ? ● Why doesn’t compiler expand everything ?
  • 28. 28 always_inline and noinline ● inline word has been slightly misunderstood – It mainly means multiple definitions are permitted ( i.e a common header with definition is included into 2 cpp ) ● always_inline and noinline are stronger hints to the compiler but one has to measure the latency impact when using it. ● Why doesn’t compiler expand everything in place ? - DLL - Virtual functions - Recursive function call - Bigger executable means more disk space and load time, also puts pressure on cache You can in general try to hint compiler not to inline small functions which are not doing anything productive or should be out of the hotpath. __atrribute__((noinline)) void some_function () { // Not doing anything useful}
  • 29. 29 Branching ● Why is branching bad ? i.e Consider I can buy / sell something and at multiple places through my execution code I’ve checks like, if( BUY == activity_type ) { }else if( SELL == activity_type ) { }else { //ERROR } ● What are the options we have ?
  • 30. 30 Branching Effects Source : Image by Mecanismo ( CC-By-SA 3.0)
  • 31. 31 Branch Prediction ● Consider an if statement shown above : At the processor level, it’s actually a branch instruction ! ( Assume : data[c] is between 0 – 255 values, c is a counter which is looping over the array ) ● Processors are smart to prefetch a set of instruction to speed up the execution time ● Your processor sees a branch and it has no idea which way it will go – what it will have to do is halt the execution and wait until the previous instructions are complete and it can pick the correct path ● Modern processors are quite complicated and they have long pipelines, So they take forever to “warm up” and “slow down” ● What are the alternatives – develop your code which is friendly enough for branch prediction to work ( i.e If possible sort the array, will improve branch prediction ) ● Apply some smart hacks with assumptions which are valid ( No Branching with below replacement and your train never has to stop here ) int t = ( data[c] – 128 ) >> 31 ; sum += ~t & data[c] ; ●
  • 32. 32 Further Branching Improvments Source : CPPCON ( Carl Cook )
  • 34. 34 Noisy Neighbours Solution ● You have to be very careful in choosing which all processes run on the system. ● Which processes are actually sharing the L2 cache ● Identify if there is any process messing up with L3 cache which is impacting the performance of production application in turn ● One can actually disable cores which are not being used to sort of lock the cache, disable hyper-threading to ensure better use of L2 cache ● There are various hacks available to control some kernel modules to not cache data and rather actually make use of RAM
  • 35. 35 Performance Measurements Challenges : ● How do you measure very micro blocks of code where the mere measurement itself may be taking more time, gettimeofday in linux with tsc clock kernel itself takes ~120/150 cpu cycles. ● Measurement in an offline setup will be far far away from the one observed in Production system ● How do you analyze which are the slow performing units of the code ? ● Do you actually try to take a look at some assembly code and how useful it is in practical scenarios ? ● How useful are the tools like cachegrind, gprof, papi libs with counters ?
  • 36. 36 Measurements of HFT system performance Source : CPPCON ( Carl Cook )
  • 37. 37 Talk of the town – FPGA ● This is the current area of focus for most of the HFT firms now a days. ● A pure end – end FPGA solution is quite complex and requires lot of time and effort ● Not everything can be optimized in FPGA since at present most FPGA boards operate at around 2/2.5 GHz. ● A lot of the firms like us are trying develop a hybrid end – end solution in FPGA where we can retain best of the software and hardware. ● The primary motivation to move to FPGA is remove jitters from the system, no software solution can offer as good stability in latency as hardware can. The major concern here is no one wants to be slow even during 1% of the time under which the application is trading. You can be fastest to make money 99% of the time but jitters can wipe it all away !! Questions ??
  • 38. 38 THANK YOU ! Contact : ravi.parikh@tworoads-trading.co.in