SlideShare a Scribd company logo
Mateusz Pusz
September 18, 2017
STRIVING FOR ULTIMATE
LOW LATENCY
INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
| Striving for ultimate Low Latency
LATENCY VS THROUGHPUT
2
Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds or clock periods.
| Striving for ultimate Low Latency
LATENCY VS THROUGHPUT
2
Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds or clock periods.
Throughput is the number of such actions executed or results
produced per unit of time. Measured in units of whatever is being
produced per unit of time.
| Striving for ultimate Low Latency
LATENCY VS THROUGHPUT
2
| Striving for ultimate Low Latency
WHAT DO WE MEAN BY LOW LATENCY?
3
Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics.
| Striving for ultimate Low Latency
WHAT DO WE MEAN BY LOW LATENCY?
3
Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics.
| Striving for ultimate Low Latency
WHAT DO WE MEAN BY LOW LATENCY?
Especially important for internet connections utilizing services such as
trading, online gaming and VoIP.
3
| Striving for ultimate Low Latency
WHY DO WE STRIVE FOR LOW LATENCY?
4
• In VoIP substantial delays between input from conversation participants may impair their
communication
| Striving for ultimate Low Latency
WHY DO WE STRIVE FOR LOW LATENCY?
4
• In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or the appropriate reaction time
| Striving for ultimate Low Latency
WHY DO WE STRIVE FOR LOW LATENCY?
4
• In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or the appropriate reaction time
• Within capital markets the proliferation of algorithmic trading requires firms to react to market events
faster than the competition to increase profitability of trades
| Striving for ultimate Low Latency
WHY DO WE STRIVE FOR LOW LATENCY?
4
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
| Striving for ultimate Low Latency
HIGH-FREQUENCY TRADING (HFT)
5
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
• Typically, the traders with the fastest execution speeds are more profitable
| Striving for ultimate Low Latency
HIGH-FREQUENCY TRADING (HFT)
5
| Striving for ultimate Low Latency
MARKET DATA PROCESSING
6
1-10us 100-1000ns
| Striving for ultimate Low Latency
HOW FAST DO WE DO?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
7
1-10us 100-1000ns
| Striving for ultimate Low Latency
HOW FAST DO WE DO?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
7
1-10us 100-1000ns
• Average human eye blink takes 350 000us (1/3s)
• Millions of orders can be traded that time
| Striving for ultimate Low Latency
HOW FAST DO WE DO?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
7
| Striving for ultimate Low Latency
WHAT IF SOMETHING GOES WRONG?
8
• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3% on NYSE
– 16.9% on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
| Striving for ultimate Low Latency
WHAT IF SOMETHING GOES WRONG?
KNIGHT CAPITAL
8
• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3% on NYSE
– 16.9% on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
• pre-tax loss of $440 million in 45 minutes
-- LinkedIn
| Striving for ultimate Low Latency
WHAT IF SOMETHING GOES WRONG?
KNIGHT CAPITAL
9
• Low Latency network
• Modern hardware
• BIOS profiling
• Kernel profiling
• OS profiling
| Striving for ultimate Low Latency
C++ OFTEN NOT THE MOST IMPORTANT PART OF THE SYSTEM
10
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
| Striving for ultimate Low Latency
SPIN, PIN, AND DROP-IN
SPIN
11
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU a inity
• Assign interrupt a inity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
| Striving for ultimate Low Latency
SPIN, PIN, AND DROP-IN
SPIN PIN
11
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU a inity
• Assign interrupt a inity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
• Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries
• Use the kernel bypass library
| Striving for ultimate Low Latency
SPIN, PIN, AND DROP-IN
SPIN PIN
DROP-IN
11
LET'S SCOPE ON THE
SOFTWARE
• Typically only a small part of code is really important (fast path)
| Striving for ultimate Low Latency
CHARACTERISTICS OF LOW LATENCY SOFTWARE
13
• Typically only a small part of code is really important (fast path)
• That code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
| Striving for ultimate Low Latency
CHARACTERISTICS OF LOW LATENCY SOFTWARE
13
• Typically only a small part of code is really important (fast path)
• That code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– it is about low latency and not throughput
– concurrency (even on di erent cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
| Striving for ultimate Low Latency
CHARACTERISTICS OF LOW LATENCY SOFTWARE
13
• Typically only a small part of code is really important (fast path)
• That code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– it is about low latency and not throughput
– concurrency (even on di erent cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
• Mistakes are really costly
– good error checking and recovery is mandatory
– one second is 4 billion CPU instructions (a lot can happen that time)
| Striving for ultimate Low Latency
CHARACTERISTICS OF LOW LATENCY SOFTWARE
13
| Striving for ultimate Low Latency
HOW TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
14
It turns out that the more important question here is...
| Striving for ultimate Low Latency
HOW TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
14
| Striving for ultimate Low Latency
HOW NOT TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
15
• In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
• This is not only the task for developers but
also for code architects
| Striving for ultimate Low Latency
HOW NOT TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
16
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
6 Dynamic memory allocations
| Striving for ultimate Low Latency
THINGS TO AVOID ON THE FAST PATH
17
template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Support user provided deleter
| Striving for ultimate Low Latency
std::shared_ptr<T>
18
template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Support user provided deleter
| Striving for ultimate Low Latency
std::shared_ptr<T>
Too o en overused by C++ programmers
18
void foo()
{
std::unique_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
void foo()
{
std::shared_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
| Striving for ultimate Low Latency
QUESTION: WHAT IS THE DIFFERENCE HERE?
19
• Shared state
– performance + memory footprint
• Mandatory synchronization
– performance
• Type Erasure
– performance
• std::weak_ptr<T> support
– memory footprint
• Aliasing constructor
– memory footprint
| Striving for ultimate Low Latency
KEY std::shared_ptr<T> ISSUES
20
| Striving for ultimate Low Latency
MORE INFO ON CODE::DIVE 2016
21
| Striving for ultimate Low Latency
C++ EXCEPTIONS
22
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
| Striving for ultimate Low Latency
C++ EXCEPTIONS
22
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
| Striving for ultimate Low Latency
C++ EXCEPTIONS
22
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
| Striving for ultimate Low Latency
C++ EXCEPTIONS
22
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– (if not thrown) actually can improve application performance
– cannot be ignored!
– simplify interfaces
– make source code of likely path easier to reason about
| Striving for ultimate Low Latency
C++ EXCEPTIONS
22
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– (if not thrown) actually can improve application performance
– cannot be ignored!
– simplify interfaces
– make source code of likely path easier to reason about
| Striving for ultimate Low Latency
C++ EXCEPTIONS
Not using C++ exceptions is not an excuse to write not exception-safe code!
22
class base {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
| Striving for ultimate Low Latency
POLYMORPHISM
DYNAMIC
23
class base {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
• Additional pointer stored in an object
• Extra indirection (pointer dereference)
• O en not possible to devirtualize
• Not inlined
• Instruction cache miss
| Striving for ultimate Low Latency
POLYMORPHISM
DYNAMIC
23
class base {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
template<class Derived>
class base {
public:
void process()
{
static_cast<Derived*>(this)->setup();
static_cast<Derived*>(this)->run();
static_cast<Derived*>(this)->cleanup();
}
};
class derived : public base<derived> {
friend class base<derived>;
void setup() { /* ... */ }
void run() { /* ... */ }
void cleanup() { /* ... */ }
};
| Striving for ultimate Low Latency
POLYMORPHISM
DYNAMIC STATIC
24
• this pointer adjustments needed to call
member function (for not empty base classes)
| Striving for ultimate Low Latency
MULTIPLE INHERITANCE
MULTIPLE INHERITANCE
25
• this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
| Striving for ultimate Low Latency
MULTIPLE INHERITANCE
MULTIPLE INHERITANCE
DIAMOND OF DREAD
26
• this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
Always consider composition before inheritance!
| Striving for ultimate Low Latency
MULTIPLE INHERITANCE
MULTIPLE INHERITANCE
DIAMOND OF DREAD
26
class base {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
| Striving for ultimate Low Latency
RUNTIME TYPE IDENTIFICATION (RTTI)
27
class base {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
| Striving for ultimate Low Latency
RUNTIME TYPE IDENTIFICATION (RTTI)
27
class base {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
| Striving for ultimate Low Latency
RUNTIME TYPE IDENTIFICATION (RTTI)
O en the sign of a smelly design
27
class base {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
| Striving for ultimate Low Latency
RUNTIME TYPE IDENTIFICATION (RTTI)
28
class base {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
void foo(base& b)
{
if(typeid(b) == typeid(derived)) {
derived* d = static_cast<derived*>(&b);
d->boo();
}
}
• Only one comparison of std::type_info
• O en only one runtime pointer compare
| Striving for ultimate Low Latency
RUNTIME TYPE IDENTIFICATION (RTTI)
28
• General purpose operation
• Nondeterministic execution performance
• Causes memory fragmentation
• Memory leaks possible if not properly handled
• May fail (error handling is needed)
| Striving for ultimate Low Latency
DYNAMIC MEMORY ALLOCATIONS
29
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
| Striving for ultimate Low Latency
CUSTOM ALLOCATORS TO THE RESCUE
30
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
| Striving for ultimate Low Latency
CUSTOM ALLOCATORS TO THE RESCUE
30
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
| Striving for ultimate Low Latency
CUSTOM ALLOCATORS TO THE RESCUE
Preallocation makes the allocator jitter more stable, helps in keeping related
data together and avoiding long term fragmentation.
30
Prevent dynamic memory allocation for the (common) case of
dealing with small objects
| Striving for ultimate Low Latency
SMALL OBJECT OPTIMIZATION (SOO / SSO / SBO)
31
Prevent dynamic memory allocation for the (common) case of
dealing with small objects
class sso_string {
char* data_ = u_.sso_;
size_t size_ = 0;
union {
char sso_[16] = "";
size_t capacity_;
} u_;
public:
size_t capacity() const { return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_; }
// ...
};
| Striving for ultimate Low Latency
SMALL OBJECT OPTIMIZATION (SOO / SSO / SBO)
31
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
| Striving for ultimate Low Latency
NO DYNAMIC ALLOCATION
32
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
| Striving for ultimate Low Latency
NO DYNAMIC ALLOCATION
32
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
| Striving for ultimate Low Latency
NO DYNAMIC ALLOCATION
No dynamic memory allocations or pointer indirections guaranteed with the
cost of possibly bigger memory usage
32
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100% user space code)
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
• Keep the number of threads close (less or equal) to the number available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures, data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100% user space code)
• Measure performance… ALWAYS
| Striving for ultimate Low Latency
HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33
| Striving for ultimate Low Latency
THE MOST IMPORTANT RECOMMENDATION
34
Always measure your performance!
| Striving for ultimate Low Latency
THE MOST IMPORTANT RECOMMENDATION
34
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
| Striving for ultimate Low Latency
HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS
35
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
• Prefer hardware based black box performance meassurements
| Striving for ultimate Low Latency
HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS
35
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO
"${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -fno-omit-frame-pointer")
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
| Striving for ultimate Low Latency
HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS
35
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO
"${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -fno-omit-frame-pointer")
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
• Verify output assembly code
| Striving for ultimate Low Latency
HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS
35
Flame Graph Search
test_bu..
x..
d..
s..
gener..
S..
_..
unary_..
execute_command_internal
x..
bash
shell_expand_word_list
red..
execute_builtin
__xst..
__dup2
main
_.. do_redirection_internal
do_redirections
do_f..
ext4_f..
__GI___li..
__..
expand_word_list_internal
unary_..
expan..tra..
path..
execute_command
two_ar..
cleanup_redi..
execute_command
do_sy..
trac..
__GI___libc_..
posixt..
execute_builtin_or_function
_..
[unknown]
tra..
expand_word_internal
_..
tracesys
ex..
_..
vfs_write
c..
test_co..
e..
u..
v..
SY..
i..
execute_simple_command
execute_command_internal
p.. gl..
s..sys_write
__GI___l..
expand_words
d..
__libc_start_main
d..
do_sync..
generi..
e..
reader_loop
sy..
execute_while_command
do_redirecti..
sys_o..
g..
__gene..
__xsta..
tracesys
c..
_..
execute_while_or_until
| Striving for ultimate Low Latency
FLAMEGRAPH
36
Striving for ultimate Low Latency
Striving for ultimate Low Latency

More Related Content

Similar to Striving for ultimate Low Latency

Striving for ultimate Low Latency
Striving for ultimate Low LatencyStriving for ultimate Low Latency
Striving for ultimate Low Latency
Mateusz Pusz
 
The survey on real time operating systems (1)
The survey on real time operating systems (1)The survey on real time operating systems (1)
The survey on real time operating systems (1)
manojkumarsmks
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
Kernel TLV
 
Large Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusLarge Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - Nautilus
Gabriele Di Bernardo
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
Peter Lawrey
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
SingleStore
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
Sebastian Andrasoni
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
Anmol Purohit
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Redis Labs
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
Syed Zaid Irshad
 
Intro to Embedded OS, RTOS and Communication Protocols
Intro to Embedded OS, RTOS and Communication ProtocolsIntro to Embedded OS, RTOS and Communication Protocols
Intro to Embedded OS, RTOS and Communication Protocols
Emertxe Information Technologies Pvt Ltd
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
AmarDura2
 
13 risc
13 risc13 risc
13 risc
Anwal Mirza
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
David Simons
 
Simplifying debugging for multi-core Linux devices and low-power Linux clusters
Simplifying debugging for multi-core Linux devices and low-power Linux clusters Simplifying debugging for multi-core Linux devices and low-power Linux clusters
Simplifying debugging for multi-core Linux devices and low-power Linux clusters
Rogue Wave Software
 
Long Life Software
Long Life SoftwareLong Life Software
Long Life Software
Mike Long
 
03 performance
03 performance03 performance
03 performance
marangburu42
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
Peter Lawrey
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
StreamNative
 

Similar to Striving for ultimate Low Latency (20)

Striving for ultimate Low Latency
Striving for ultimate Low LatencyStriving for ultimate Low Latency
Striving for ultimate Low Latency
 
The survey on real time operating systems (1)
The survey on real time operating systems (1)The survey on real time operating systems (1)
The survey on real time operating systems (1)
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Large Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusLarge Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - Nautilus
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Intro to Embedded OS, RTOS and Communication Protocols
Intro to Embedded OS, RTOS and Communication ProtocolsIntro to Embedded OS, RTOS and Communication Protocols
Intro to Embedded OS, RTOS and Communication Protocols
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Simplifying debugging for multi-core Linux devices and low-power Linux clusters
Simplifying debugging for multi-core Linux devices and low-power Linux clusters Simplifying debugging for multi-core Linux devices and low-power Linux clusters
Simplifying debugging for multi-core Linux devices and low-power Linux clusters
 
Long Life Software
Long Life SoftwareLong Life Software
Long Life Software
 
03 performance
03 performance03 performance
03 performance
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
 

More from Mateusz Pusz

Free Lunch is Over: Why is C++ so Important in the Modern World?
Free Lunch is Over: Why is C++ so Important in the Modern World?Free Lunch is Over: Why is C++ so Important in the Modern World?
Free Lunch is Over: Why is C++ so Important in the Modern World?
Mateusz Pusz
 
A Physical Units Library for the Next C++
A Physical Units Library for the Next C++A Physical Units Library for the Next C++
A Physical Units Library for the Next C++
Mateusz Pusz
 
Rethinking the Way We do Templates in C++
Rethinking the Way We do Templates in C++Rethinking the Way We do Templates in C++
Rethinking the Way We do Templates in C++
Mateusz Pusz
 
C++11 Was Only the Beginning
C++11 Was Only the BeginningC++11 Was Only the Beginning
C++11 Was Only the Beginning
Mateusz Pusz
 
Implementing a Physical Units Library for C++
Implementing a Physical Units Library for C++Implementing a Physical Units Library for C++
Implementing a Physical Units Library for C++
Mateusz Pusz
 
Effective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variantEffective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variant
Mateusz Pusz
 
Implementing Physical Units Library for C++
Implementing Physical Units Library for C++Implementing Physical Units Library for C++
Implementing Physical Units Library for C++
Mateusz Pusz
 
C++ Concepts and Ranges - How to use them?
C++ Concepts and Ranges - How to use them?C++ Concepts and Ranges - How to use them?
C++ Concepts and Ranges - How to use them?
Mateusz Pusz
 
Effective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variantEffective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variant
Mateusz Pusz
 
Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?
Mateusz Pusz
 
Beyond C++17
Beyond C++17Beyond C++17
Beyond C++17
Mateusz Pusz
 
Pointless Pointers - How to make our interfaces efficient?
Pointless Pointers - How to make our interfaces efficient?Pointless Pointers - How to make our interfaces efficient?
Pointless Pointers - How to make our interfaces efficient?
Mateusz Pusz
 
Small Lie in Big O
Small Lie in Big OSmall Lie in Big O
Small Lie in Big O
Mateusz Pusz
 
std::shared_ptr<T> - (not so) Smart hammer for every pointy nail
std::shared_ptr<T> - (not so) Smart hammer for every pointy nailstd::shared_ptr<T> - (not so) Smart hammer for every pointy nail
std::shared_ptr<T> - (not so) Smart hammer for every pointy nail
Mateusz Pusz
 

More from Mateusz Pusz (14)

Free Lunch is Over: Why is C++ so Important in the Modern World?
Free Lunch is Over: Why is C++ so Important in the Modern World?Free Lunch is Over: Why is C++ so Important in the Modern World?
Free Lunch is Over: Why is C++ so Important in the Modern World?
 
A Physical Units Library for the Next C++
A Physical Units Library for the Next C++A Physical Units Library for the Next C++
A Physical Units Library for the Next C++
 
Rethinking the Way We do Templates in C++
Rethinking the Way We do Templates in C++Rethinking the Way We do Templates in C++
Rethinking the Way We do Templates in C++
 
C++11 Was Only the Beginning
C++11 Was Only the BeginningC++11 Was Only the Beginning
C++11 Was Only the Beginning
 
Implementing a Physical Units Library for C++
Implementing a Physical Units Library for C++Implementing a Physical Units Library for C++
Implementing a Physical Units Library for C++
 
Effective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variantEffective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variant
 
Implementing Physical Units Library for C++
Implementing Physical Units Library for C++Implementing Physical Units Library for C++
Implementing Physical Units Library for C++
 
C++ Concepts and Ranges - How to use them?
C++ Concepts and Ranges - How to use them?C++ Concepts and Ranges - How to use them?
C++ Concepts and Ranges - How to use them?
 
Effective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variantEffective replacement of dynamic polymorphism with std::variant
Effective replacement of dynamic polymorphism with std::variant
 
Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?
 
Beyond C++17
Beyond C++17Beyond C++17
Beyond C++17
 
Pointless Pointers - How to make our interfaces efficient?
Pointless Pointers - How to make our interfaces efficient?Pointless Pointers - How to make our interfaces efficient?
Pointless Pointers - How to make our interfaces efficient?
 
Small Lie in Big O
Small Lie in Big OSmall Lie in Big O
Small Lie in Big O
 
std::shared_ptr<T> - (not so) Smart hammer for every pointy nail
std::shared_ptr<T> - (not so) Smart hammer for every pointy nailstd::shared_ptr<T> - (not so) Smart hammer for every pointy nail
std::shared_ptr<T> - (not so) Smart hammer for every pointy nail
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Striving for ultimate Low Latency

  • 1. Mateusz Pusz September 18, 2017 STRIVING FOR ULTIMATE LOW LATENCY INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
  • 2. | Striving for ultimate Low Latency LATENCY VS THROUGHPUT 2
  • 3. Latency is the time required to perform some action or to produce some result. Measured in units of time like hours, minutes, seconds, nanoseconds or clock periods. | Striving for ultimate Low Latency LATENCY VS THROUGHPUT 2
  • 4. Latency is the time required to perform some action or to produce some result. Measured in units of time like hours, minutes, seconds, nanoseconds or clock periods. Throughput is the number of such actions executed or results produced per unit of time. Measured in units of whatever is being produced per unit of time. | Striving for ultimate Low Latency LATENCY VS THROUGHPUT 2
  • 5. | Striving for ultimate Low Latency WHAT DO WE MEAN BY LOW LATENCY? 3
  • 6. Low Latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics. | Striving for ultimate Low Latency WHAT DO WE MEAN BY LOW LATENCY? 3
  • 7. Low Latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics. | Striving for ultimate Low Latency WHAT DO WE MEAN BY LOW LATENCY? Especially important for internet connections utilizing services such as trading, online gaming and VoIP. 3
  • 8. | Striving for ultimate Low Latency WHY DO WE STRIVE FOR LOW LATENCY? 4
  • 9. • In VoIP substantial delays between input from conversation participants may impair their communication | Striving for ultimate Low Latency WHY DO WE STRIVE FOR LOW LATENCY? 4
  • 10. • In VoIP substantial delays between input from conversation participants may impair their communication • In online gaming a player with a high latency internet connection may show slow responses in spite of superior tactics or the appropriate reaction time | Striving for ultimate Low Latency WHY DO WE STRIVE FOR LOW LATENCY? 4
  • 11. • In VoIP substantial delays between input from conversation participants may impair their communication • In online gaming a player with a high latency internet connection may show slow responses in spite of superior tactics or the appropriate reaction time • Within capital markets the proliferation of algorithmic trading requires firms to react to market events faster than the competition to increase profitability of trades | Striving for ultimate Low Latency WHY DO WE STRIVE FOR LOW LATENCY? 4
  • 12. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia | Striving for ultimate Low Latency HIGH-FREQUENCY TRADING (HFT) 5
  • 13. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours • Implies high turnover of capital (i.e. one's entire capital or more in a single day) • Typically, the traders with the fastest execution speeds are more profitable | Striving for ultimate Low Latency HIGH-FREQUENCY TRADING (HFT) 5
  • 14. | Striving for ultimate Low Latency MARKET DATA PROCESSING 6
  • 15. 1-10us 100-1000ns | Striving for ultimate Low Latency HOW FAST DO WE DO? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 7
  • 16. 1-10us 100-1000ns | Striving for ultimate Low Latency HOW FAST DO WE DO? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 7
  • 17. 1-10us 100-1000ns • Average human eye blink takes 350 000us (1/3s) • Millions of orders can be traded that time | Striving for ultimate Low Latency HOW FAST DO WE DO? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 7
  • 18. | Striving for ultimate Low Latency WHAT IF SOMETHING GOES WRONG? 8
  • 19. • In 2012 was the largest trader in U.S. equities • Market share – 17.3% on NYSE – 16.9% on NASDAQ • Had approximately $365 million in cash and equivalents • Average daily trading volume – 3.3 billion trades – trading over 21 billion dollars | Striving for ultimate Low Latency WHAT IF SOMETHING GOES WRONG? KNIGHT CAPITAL 8
  • 20. • In 2012 was the largest trader in U.S. equities • Market share – 17.3% on NYSE – 16.9% on NASDAQ • Had approximately $365 million in cash and equivalents • Average daily trading volume – 3.3 billion trades – trading over 21 billion dollars • pre-tax loss of $440 million in 45 minutes -- LinkedIn | Striving for ultimate Low Latency WHAT IF SOMETHING GOES WRONG? KNIGHT CAPITAL 9
  • 21. • Low Latency network • Modern hardware • BIOS profiling • Kernel profiling • OS profiling | Striving for ultimate Low Latency C++ OFTEN NOT THE MOST IMPORTANT PART OF THE SYSTEM 10
  • 22. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing | Striving for ultimate Low Latency SPIN, PIN, AND DROP-IN SPIN 11
  • 23. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing • Assign CPU a inity • Assign interrupt a inity • Assign memory to NUMA nodes • Consider the physical location of NICs • Isolate cores from general OS use • Use a system with a single physical CPU | Striving for ultimate Low Latency SPIN, PIN, AND DROP-IN SPIN PIN 11
  • 24. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing • Assign CPU a inity • Assign interrupt a inity • Assign memory to NUMA nodes • Consider the physical location of NICs • Isolate cores from general OS use • Use a system with a single physical CPU • Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries • Use the kernel bypass library | Striving for ultimate Low Latency SPIN, PIN, AND DROP-IN SPIN PIN DROP-IN 11
  • 25. LET'S SCOPE ON THE SOFTWARE
  • 26. • Typically only a small part of code is really important (fast path) | Striving for ultimate Low Latency CHARACTERISTICS OF LOW LATENCY SOFTWARE 13
  • 27. • Typically only a small part of code is really important (fast path) • That code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) | Striving for ultimate Low Latency CHARACTERISTICS OF LOW LATENCY SOFTWARE 13
  • 28. • Typically only a small part of code is really important (fast path) • That code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) • Multithreading increases latency – it is about low latency and not throughput – concurrency (even on di erent cores) trashes CPU caches above L1, share memory bus, shares IO, shares network | Striving for ultimate Low Latency CHARACTERISTICS OF LOW LATENCY SOFTWARE 13
  • 29. • Typically only a small part of code is really important (fast path) • That code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) • Multithreading increases latency – it is about low latency and not throughput – concurrency (even on di erent cores) trashes CPU caches above L1, share memory bus, shares IO, shares network • Mistakes are really costly – good error checking and recovery is mandatory – one second is 4 billion CPU instructions (a lot can happen that time) | Striving for ultimate Low Latency CHARACTERISTICS OF LOW LATENCY SOFTWARE 13
  • 30. | Striving for ultimate Low Latency HOW TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE PERFORMANCE? 14
  • 31. It turns out that the more important question here is... | Striving for ultimate Low Latency HOW TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE PERFORMANCE? 14
  • 32. | Striving for ultimate Low Latency HOW NOT TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE PERFORMANCE? 15
  • 33. • In Low Latency system we care a lot about WCET (Worst Case Execution Time) • In order to limit WCET we should limit the usage of specific C++ language features • This is not only the task for developers but also for code architects | Striving for ultimate Low Latency HOW NOT TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE PERFORMANCE? 16
  • 34. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on likely code path 3 Dynamic polymorphism 4 Multiple inheritance 5 RTTI 6 Dynamic memory allocations | Striving for ultimate Low Latency THINGS TO AVOID ON THE FAST PATH 17
  • 35. template<class T> class shared_ptr; • Smart pointer that retains shared ownership of an object through a pointer • Several shared_ptr objects may own the same object • The shared object is destroyed and its memory deallocated when the last remaining shared_ptr owning that object is either destroyed or assigned another pointer via operator= or reset() • Support user provided deleter | Striving for ultimate Low Latency std::shared_ptr<T> 18
  • 36. template<class T> class shared_ptr; • Smart pointer that retains shared ownership of an object through a pointer • Several shared_ptr objects may own the same object • The shared object is destroyed and its memory deallocated when the last remaining shared_ptr owning that object is either destroyed or assigned another pointer via operator= or reset() • Support user provided deleter | Striving for ultimate Low Latency std::shared_ptr<T> Too o en overused by C++ programmers 18
  • 37. void foo() { std::unique_ptr<int> ptr{new int{1}}; // some code using 'ptr' } void foo() { std::shared_ptr<int> ptr{new int{1}}; // some code using 'ptr' } | Striving for ultimate Low Latency QUESTION: WHAT IS THE DIFFERENCE HERE? 19
  • 38. • Shared state – performance + memory footprint • Mandatory synchronization – performance • Type Erasure – performance • std::weak_ptr<T> support – memory footprint • Aliasing constructor – memory footprint | Striving for ultimate Low Latency KEY std::shared_ptr<T> ISSUES 20
  • 39. | Striving for ultimate Low Latency MORE INFO ON CODE::DIVE 2016 21
  • 40. | Striving for ultimate Low Latency C++ EXCEPTIONS 22
  • 41. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions | Striving for ultimate Low Latency C++ EXCEPTIONS 22
  • 42. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown | Striving for ultimate Low Latency C++ EXCEPTIONS 22
  • 43. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time | Striving for ultimate Low Latency C++ EXCEPTIONS 22
  • 44. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time • Advantages of C++ exceptions usage – (if not thrown) actually can improve application performance – cannot be ignored! – simplify interfaces – make source code of likely path easier to reason about | Striving for ultimate Low Latency C++ EXCEPTIONS 22
  • 45. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time • Advantages of C++ exceptions usage – (if not thrown) actually can improve application performance – cannot be ignored! – simplify interfaces – make source code of likely path easier to reason about | Striving for ultimate Low Latency C++ EXCEPTIONS Not using C++ exceptions is not an excuse to write not exception-safe code! 22
  • 46. class base { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; | Striving for ultimate Low Latency POLYMORPHISM DYNAMIC 23
  • 47. class base { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; • Additional pointer stored in an object • Extra indirection (pointer dereference) • O en not possible to devirtualize • Not inlined • Instruction cache miss | Striving for ultimate Low Latency POLYMORPHISM DYNAMIC 23
  • 48. class base { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; template<class Derived> class base { public: void process() { static_cast<Derived*>(this)->setup(); static_cast<Derived*>(this)->run(); static_cast<Derived*>(this)->cleanup(); } }; class derived : public base<derived> { friend class base<derived>; void setup() { /* ... */ } void run() { /* ... */ } void cleanup() { /* ... */ } }; | Striving for ultimate Low Latency POLYMORPHISM DYNAMIC STATIC 24
  • 49. • this pointer adjustments needed to call member function (for not empty base classes) | Striving for ultimate Low Latency MULTIPLE INHERITANCE MULTIPLE INHERITANCE 25
  • 50. • this pointer adjustments needed to call member function (for not empty base classes) • Virtual inheritance as an answer • virtual in C++ means "determined at runtime" • Extra indirection to access data members | Striving for ultimate Low Latency MULTIPLE INHERITANCE MULTIPLE INHERITANCE DIAMOND OF DREAD 26
  • 51. • this pointer adjustments needed to call member function (for not empty base classes) • Virtual inheritance as an answer • virtual in C++ means "determined at runtime" • Extra indirection to access data members Always consider composition before inheritance! | Striving for ultimate Low Latency MULTIPLE INHERITANCE MULTIPLE INHERITANCE DIAMOND OF DREAD 26
  • 52. class base { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; | Striving for ultimate Low Latency RUNTIME TYPE IDENTIFICATION (RTTI) 27
  • 53. class base { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } | Striving for ultimate Low Latency RUNTIME TYPE IDENTIFICATION (RTTI) 27
  • 54. class base { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } | Striving for ultimate Low Latency RUNTIME TYPE IDENTIFICATION (RTTI) O en the sign of a smelly design 27
  • 55. class base { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } • Traversing an inheritance tree • Comparisons | Striving for ultimate Low Latency RUNTIME TYPE IDENTIFICATION (RTTI) 28
  • 56. class base { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } • Traversing an inheritance tree • Comparisons void foo(base& b) { if(typeid(b) == typeid(derived)) { derived* d = static_cast<derived*>(&b); d->boo(); } } • Only one comparison of std::type_info • O en only one runtime pointer compare | Striving for ultimate Low Latency RUNTIME TYPE IDENTIFICATION (RTTI) 28
  • 57. • General purpose operation • Nondeterministic execution performance • Causes memory fragmentation • Memory leaks possible if not properly handled • May fail (error handling is needed) | Striving for ultimate Low Latency DYNAMIC MEMORY ALLOCATIONS 29
  • 58. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory | Striving for ultimate Low Latency CUSTOM ALLOCATORS TO THE RESCUE 30
  • 59. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory template<typename T> struct pool_allocator { T* allocate(std::size_t n); void deallocate(T* p, std::size_t n); }; using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>; | Striving for ultimate Low Latency CUSTOM ALLOCATORS TO THE RESCUE 30
  • 60. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory template<typename T> struct pool_allocator { T* allocate(std::size_t n); void deallocate(T* p, std::size_t n); }; using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>; | Striving for ultimate Low Latency CUSTOM ALLOCATORS TO THE RESCUE Preallocation makes the allocator jitter more stable, helps in keeping related data together and avoiding long term fragmentation. 30
  • 61. Prevent dynamic memory allocation for the (common) case of dealing with small objects | Striving for ultimate Low Latency SMALL OBJECT OPTIMIZATION (SOO / SSO / SBO) 31
  • 62. Prevent dynamic memory allocation for the (common) case of dealing with small objects class sso_string { char* data_ = u_.sso_; size_t size_ = 0; union { char sso_[16] = ""; size_t capacity_; } u_; public: size_t capacity() const { return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_; } // ... }; | Striving for ultimate Low Latency SMALL OBJECT OPTIMIZATION (SOO / SSO / SBO) 31
  • 63. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; | Striving for ultimate Low Latency NO DYNAMIC ALLOCATION 32
  • 64. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; struct db_contact { inplace_string<7> symbol; inplace_string<15> name; inplace_string<15> surname; inplace_string<23> company; }; | Striving for ultimate Low Latency NO DYNAMIC ALLOCATION 32
  • 65. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; struct db_contact { inplace_string<7> symbol; inplace_string<15> name; inplace_string<15> surname; inplace_string<23> company; }; | Striving for ultimate Low Latency NO DYNAMIC ALLOCATION No dynamic memory allocations or pointer indirections guaranteed with the cost of possibly bigger memory usage 32
  • 66. | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 67. • Keep the number of threads close (less or equal) to the number available physical CPU cores | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 68. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 69. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 70. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 71. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 72. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 73. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 74. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 75. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 76. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! • Bypass the kernel (100% user space code) | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 77. • Keep the number of threads close (less or equal) to the number available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures, data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! • Bypass the kernel (100% user space code) • Measure performance… ALWAYS | Striving for ultimate Low Latency HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS 33
  • 78. | Striving for ultimate Low Latency THE MOST IMPORTANT RECOMMENDATION 34
  • 79. Always measure your performance! | Striving for ultimate Low Latency THE MOST IMPORTANT RECOMMENDATION 34
  • 80. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo | Striving for ultimate Low Latency HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS 35
  • 81. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo • Prefer hardware based black box performance meassurements | Striving for ultimate Low Latency HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS 35
  • 82. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -fno-omit-frame-pointer") • Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs • Use tools like Intel VTune | Striving for ultimate Low Latency HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS 35
  • 83. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -fno-omit-frame-pointer") • Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs • Use tools like Intel VTune • Verify output assembly code | Striving for ultimate Low Latency HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS 35
  • 84. Flame Graph Search test_bu.. x.. d.. s.. gener.. S.. _.. unary_.. execute_command_internal x.. bash shell_expand_word_list red.. execute_builtin __xst.. __dup2 main _.. do_redirection_internal do_redirections do_f.. ext4_f.. __GI___li.. __.. expand_word_list_internal unary_.. expan..tra.. path.. execute_command two_ar.. cleanup_redi.. execute_command do_sy.. trac.. __GI___libc_.. posixt.. execute_builtin_or_function _.. [unknown] tra.. expand_word_internal _.. tracesys ex.. _.. vfs_write c.. test_co.. e.. u.. v.. SY.. i.. execute_simple_command execute_command_internal p.. gl.. s..sys_write __GI___l.. expand_words d.. __libc_start_main d.. do_sync.. generi.. e.. reader_loop sy.. execute_while_command do_redirecti.. sys_o.. g.. __gene.. __xsta.. tracesys c.. _.. execute_while_or_until | Striving for ultimate Low Latency FLAMEGRAPH 36