SlideShare a Scribd company logo
1 of 208
Download to read offline
Mateusz Pusz
November 30, 2019
Striving for ultimate
Low Latency
INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
Special thanks to Carl Cook and Chris Kohlhoff
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
ISO C++ Committee (WG21) Study Group 14 (SG14)
2
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds, or clock periods.
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds, or clock periods.
Throughput is the number of such actions executed or results
produced per unit of time. Measured in units of whatever is being
produced per unit of time.
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
4
Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
4
Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
Especially important for internet connections utilizing services such as
trading, online gaming, and VoIP
4
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
• In VoIP substantial delays between input from conversation participants may impair their
communication
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
• In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or appropriate reaction time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
• In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or appropriate reaction time
• Within capital markets the proliferation of algorithmic trading requires firms to react to market events
faster than the competition to increase profitability of trades
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
• Typically, the traders with the fastest execution speeds are more profitable
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
1-10us 100-1000ns
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
1-10us 100-1000ns
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
1-10us 100-1000ns
• Average human eye blink takes 350 000us (1/3s)
• Millions of orders can be traded that time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
9
• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3%on NYSE
– 16.9%on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
KNIGHT CAPITAL
9
• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3%on NYSE
– 16.9%on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
• pre-tax loss of $440 million in 45 minutes
-- LinkedIn
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
KNIGHT CAPITAL
10
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
3 BIOS profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
3 BIOS profiling
4 Kernel profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
3 BIOS profiling
4 Kernel profiling
5 OS profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
12
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN
12
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU affinity
• Assign interrupt affinity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN PIN
12
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU affinity
• Assign interrupt affinity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
• Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries
• Use the kernel bypass library
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN PIN
DROP-IN
12
Let's scope on the software
• Typically only a small part of code is really important (a fast/hot path)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
• Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
• Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– we need low latency and not throughput
– concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
• Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– we need low latency and not throughput
– concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
• Mistakes are really expensive
– good error checking and recovery is mandatory
– one second is 4 billion CPU instructions (a lot can happen that time)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop software that have predictable performance?
15
It turns out that the more important question here is...
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop software that have predictable performance?
15
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
16
• In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
17
• In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
• A task not only for developers but also for
code architects
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
17
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
6 Dynamic memory allocations
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Supports user provided deleter
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
std::shared_ptr<T>
19
template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Supports user provided deleter
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
std::shared_ptr<T>
Too o en overused by C++ programmers
19
void foo()
{
std::unique_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
void foo()
{
std::shared_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Question: What is the di erence here?
20
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Key std::shared_ptr<T> issues
21
• Shared state
– performance + memory footprint
• Mandatory synchronization
– performance
• Type Erasure
– performance
• std::weak_ptr<T> support
– memory footprint
• Aliasing constructor
– memory footprint
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Key std::shared_ptr<T> issues
21
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
More info on code::dive 2016
22
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– cannot be ignored!
– simplify interfaces
– (if not thrown) actually can improve application performance
– make source code of likely path easier to reason about
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– cannot be ignored!
– simplify interfaces
– (if not thrown) actually can improve application performance
– make source code of likely path easier to reason about
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
Not using C++ exceptions is not an excuse to write not exception-safe code!
23
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
24
class base : noncopyable {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
DYNAMIC
24
class base : noncopyable {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
• Additional pointer stored in an object
• Extra indirection (pointer dereference)
• O en not possible to devirtualize
• Not inlined
• Instruction cache miss
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
DYNAMIC
24
class base : noncopyable {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
template<class Derived>
class base {
public:
void process()
{
static_cast<Derived*>(this)->setup();
static_cast<Derived*>(this)->run();
static_cast<Derived*>(this)->cleanup();
}
};
class derived : public base<derived> {
friend class base<derived>;
void setup() { /* ... */ }
void run() { /* ... */ }
void cleanup() { /* ... */ }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
DYNAMIC STATIC
25
• this pointer adjustments needed to call
member function (for not empty base classes)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
26
• this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
DIAMOND OF DREAD
27
• this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
Always prefer composition before inheritance!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
DIAMOND OF DREAD
27
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
28
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
28
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
O en the sign of a smelly design
28
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
29
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
void foo(base& b)
{
if(typeid(b) == typeid(derived)) {
derived* d = static_cast<derived*>(&b);
d->boo();
}
}
• Only one comparison of std::type_info
• O en only one runtime pointer compare
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
29
• General purpose operation
• Nondeterministic execution performance
• Causes memory fragmentation
• May fail (error handling is needed)
• Memory leaks possible if not properly handled
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Dynamic memory allocations
30
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
31
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
31
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
Preallocation makes the allocator jitter more stable, helps in keeping related
data together and avoiding long term fragmentation.
31
• Prevent dynamic memory allocation for the (common) case of dealing with small objects
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Small Object Optimization (SOO / SSO / SBO)
32
• Prevent dynamic memory allocation for the (common) case of dealing with small objects
class sso_string {
char* data_ = u_.sso_;
size_t size_ = 0;
union {
char sso_[16] = "";
size_t capacity_;
} u_;
public:
[[nodiscard]] size_t capacity() const
{
return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Small Object Optimization (SOO / SSO / SBO)
32
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
33
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
33
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
No dynamic memory allocations or pointer indirections guaranteed with the
cost of possibly bigger memory usage
33
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
3 Know your hardware
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
3 Know your hardware
4 Clearly isolate cold code from the fast path
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
• static_assert()
• Automatic type deduction
• Type aliases
• Move semantics
• noexcept
• constexpr
• Lambda expressions
• type_traits
• std::string_view
• Variadic templates
• and many more...
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Examples of safe to use C++ features
35
The fastest programs are those that do nothing
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Do you agree?
36
int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
foo(factorial(n)); // not compile-time
foo(factorial(4)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++98)
37
int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
foo(factorial(n)); // not compile-time
foo(factorial(4)); // not compile-time
template<int N>
struct Factorial {
static const int value =
N * Factorial<N - 1>::value;
};
template<>
struct Factorial<0> {
static const int value = 1;
};
int array[Factorial<4>::value];
const int precalc_values[] = {
Factorial<0>::value,
Factorial<1>::value,
Factorial<2>::value
};
foo(factorial<4>::value); // compile-time
// foo(factorial<n>::value); // compile-time error
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++98)
37
• 2 separate implementations
– hard to keep in sync
• Template metaprogramming is hard!
• Easier to precalculate in Excel and hardcode
results in the code ;-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Just too hard!
38
• 2 separate implementations
– hard to keep in sync
• Template metaprogramming is hard!
• Easier to precalculate in Excel and hardcode
results in the code ;-)
const int precalc_values[] = {
1,
1,
2,
6,
24,
120,
720,
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Just too hard!
38
Declares that it is possible to evaluate the value of the function or
variable at compile time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
Declares that it is possible to evaluate the value of the function or
variable at compile time
• constexpr specifier used in an object declaration implies const
• constexpr specifier used in a function or static member variable declaration implies inline
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
Declares that it is possible to evaluate the value of the function or
variable at compile time
• constexpr specifier used in an object declaration implies const
• constexpr specifier used in a function or static member variable declaration implies inline
• Such variables and functions can be used in compile time constant expressions (provided that
appropriate function arguments are given)
constexpr int factorial(int n);
std::array<int, factorial(4)> array;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
• A constexpr function must not be virtual
• The function body may contain only
– static_assert declarations
– alias declarations
– using declarations and directives
– exactly one return statement with an
expression that does not mutate any of the
objects
constexpr int factorial(int n)
{
return n <= 1 ? 1 : (n * factorial(n - 1));
}
std::array<int, factorial(4)> array;
constexpr std::array<int, 3> precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24, "Error");
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++11)
40
• A constexpr function must not be virtual
• The function body must not contain
– a goto statement and labels other than case
and default
– a try block
– an asm declaration
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– evaluation of a lambda expression
– reinterpret_cast and dynamic_cast
– changing an active member of an union
– a new or delete expression
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array<int, 3> precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24, "Error");
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++14)
41
• A constexpr function must not be virtual
• The function body must not contain
– a goto statement and labels other than case
and default
– a try block
– an asm declaration
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– reinterpret_cast and dynamic_cast
– changing an active member of an union
– a new or delete expression
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++17)
42
• The function body must not contain
– a goto statement and labels other than case
and default
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– reinterpret_cast
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++20)
43
• The consteval specifier declares a function to
be an immediate function
• Every call must produce a compile time
constant expression
• An immediate function is a constexpr function
consteval int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // compile-time
// foo(factorial(n)); // compile-time error
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++20)
44
template<typename T>
struct is_array : std::false_type {};
template<typename T>
struct is_array<T[]> : std::true_type {};
template<typename T>
constexpr bool is_array_v = is_array<T>::value;
static_assert(is_array_v<int> == false);
static_assert(is_array_v<int[]>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++14)
45
template<typename T>
struct is_array : std::false_type {};
template<typename T>
struct is_array<T[]> : std::true_type {};
template<typename T>
constexpr bool is_array_v = is_array<T>::value;
static_assert(is_array_v<int> == false);
static_assert(is_array_v<int[]>);
template<typename T>
class smart_ptr {
void destroy(std::true_type) noexcept
{ delete[] ptr_; }
void destroy(std::false_type) noexcept
{ delete ptr_; }
void destroy() noexcept
{ destroy(is_array<T>()); }
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++14)
Tag dispatch provides the possibility to select the proper function overload in
compile-time based on properties of a type
45
template<typename T>
inline constexpr bool is_array = false;
template<typename T>
inline constexpr bool is_array<T[]> = true;
static_assert(is_array<int> == false);
static_assert(is_array<int[]>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++17)
46
template<typename T>
inline constexpr bool is_array = false;
template<typename T>
inline constexpr bool is_array<T[]> = true;
static_assert(is_array<int> == false);
static_assert(is_array<int[]>);
template<typename T>
class smart_ptr {
void destroy() noexcept
{
if constexpr(is_array<T>)
delete[] ptr_;
else
delete ptr_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++17)
Usage of variable templates and if constexpr decrease compilation time
46
struct X {
int a, b, c;
int id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
47
struct X {
int a, b, c;
int id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
50
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
51
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
Ooops!!!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
51
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
std::copy(begin(a1), end(a1), begin(a2));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
52
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
movq %rsi, %rax
movq 8(%rdi), %rdx
movq (%rdi), %rsi
cmpq %rsi, %rdx
je .L1
movq (%rax), %rdi
subq %rsi, %rdx
jmp memmove
// 100 lines of assembly code
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
std::copy(begin(a1), end(a1), begin(a2));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
52
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
Compiler returned: 0 <source>:9:20: error: static assertion failed
9 | static_assert(std::is_trivially_copyable_v<X>);
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
Compiler returned: 0 <source>:9:20: error: static assertion failed
9 | static_assert(std::is_trivially_copyable_v<X>);
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
Enforce compile-time contracts and class invariants with contracts or
static_asserts
53
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
std::vector<price> levels_;
public:
void insert(order o)
{
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{});
if(it != end(levels_) && *it != o.price)
levels_.insert(it, o.price);
// ...
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{});
if(it != end(levels_) && *it != o.price)
levels_.insert(it, o.price);
// ...
}
bool match(price p) const
{
return compare{}(levels_.back(), p);
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
constexpr int array_size = 10'000;
int array[array_size][array_size];
for(auto i = 0L; i < array_size; ++i) {
for(auto j = 0L; j < array_size; ++j) {
array[j][i] = i + j;
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
55
constexpr int array_size = 10'000;
int array[array_size][array_size];
for(auto i = 0L; i < array_size; ++i) {
for(auto j = 0L; j < array_size; ++j) {
array[j][i] = i + j;
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
Reckless cache usage can cost you a lot of performance!
56
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
Stepping: 10
CPU MHz: 1700.035
CPU max MHz: 1700,0000
CPU min MHz: 400,0000
BogoMIPS: 3799.90
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
57
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
• column-major: 2490 ms
• row-major: 51 ms
• ratio: 47x
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
GCC 7.4.0
58
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
• column-major: 2490 ms
• row-major: 51 ms
• ratio: 47x
• column-major: 51 ms
• row-major: 50 ms
• ratio: Similar :-)
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
GCC 7.4.0
GCC 9.2.1
58
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
59
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
59
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
60
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
61
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
62
struct coordinates { int x, y; };
void draw(const coordinates& coord);
void verify(int threshold);
constexpr int OBJECT_COUNT = 1'000'000;
class objectMgr;
void process(const objectMgr& mgr)
{
const auto size = mgr.size();
for(auto i = 0UL; i < size; ++i) { draw(mgr.position(i)); }
for(auto i = 0UL; i < size; ++i) { verify(mgr.threshold(i)); }
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Another example
63
class objectMgr {
struct object {
coordinates coord;
std::string errorTxt_1;
std::string errorTxt_2;
std::string errorTxt_3;
int threshold;
std::array<char, 100> otherData;
};
std::vector<object> data_;
public:
explicit objectMgr(std::size_t size) : data_{size} {}
std::size_t size() const { return data_.size(); }
const coordinates& position(std::size_t idx) const { return data_[idx].coord; }
int threshold(std::size_t idx) const { return data_[idx].threshold; }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Naive objectMgr implementation
64
class objectMgr {
struct object {
coordinates coord;
std::string errorTxt_1;
std::string errorTxt_2;
std::string errorTxt_3;
int threshold;
std::array<char, 100> otherData;
};
std::vector<object> data_;
public:
explicit objectMgr(std::size_t size) : data_{size} {}
std::size_t size() const { return data_.size(); }
const coordinates& position(std::size_t idx) const { return data_[idx].coord; }
int threshold(std::size_t idx) const { return data_[idx].threshold; }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Naive objectMgr implementation
64
• Program optimization approach motivated by cache coherency
• Focuses on data layout
• Results in objects decomposition
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
DOD (Data-Oriented Design)
65
• Program optimization approach motivated by cache coherency
• Focuses on data layout
• Results in objects decomposition
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
DOD (Data-Oriented Design)
Keep data used together close to each other
65
class objectMgr {
std::vector<coordinates> positions_;
std::vector<int> thresholds_;
struct otherData {
struct errorData { std::string errorTxt_1, errorTxt_2, errorTxt_3; };
errorData error;
std::array<char, 100> data;
};
std::vector<otherData> coldData_;
public:
explicit objectMgr(std::size_t size) :
positions_{size}, thresholds_(size), coldData_{size} {}
std::size_t size() const { return positions_.size(); }
const coordinates& position(std::size_t idx) const { return positions_[idx]; }
int threshold(std::size_t idx) const { return thresholds_[idx]; }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Object decomposition example
66
struct A {
char c;
short s;
int i;
double d;
static double dd;
};
static_assert( sizeof(A) == ??);
static_assert(alignof(A) == ??);
struct B {
char c;
double d;
short s;
int i;
static double dd;
};
static_assert( sizeof(B) == ??);
static_assert(alignof(B) == ??);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
67
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct B {
char c; // size=1, alignment=1, padding=7
double d; // size=8, alignment=8, padding=0
short s; // size=2, alignment=2, padding=2
int i; // size=4, alignment=4, padding=0
static double dd;
};
static_assert( sizeof(B) == 24);
static_assert(alignof(B) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
67
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct B {
char c; // size=1, alignment=1, padding=7
double d; // size=8, alignment=8, padding=0
short s; // size=2, alignment=2, padding=2
int i; // size=4, alignment=4, padding=0
static double dd;
};
static_assert( sizeof(B) == 24);
static_assert(alignof(B) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
In order to satisfy alignment requirements of all non-static members of a
class, padding may be inserted a er some of its members
67
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
using array = std::array<opt, 256>;
static_assert( sizeof(array) == 24 * 256);
static_assert(alignof(array) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
using array = std::array<opt, 256>;
static_assert( sizeof(array) == 24 * 256);
static_assert(alignof(array) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
Be aware of implementation details of the tools you use every day
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
69
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
On modern hardware may be faster than aligned structure
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
69
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
On modern hardware may be faster than aligned structure
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
Not portable! May be slower or even crash
69
L1 cache reference 0.5 ns
Branch misprediction 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD 1,000,000 ns 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
https://gist.github.com/jboner/2841832
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency Numbers Every Programmer Should Know
70
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_)) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
71
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_)) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
71
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
void reallocate(size_t len) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
72
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
Code is data too!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
73
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
Code is data too!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
Performance improvement of 20%thanks to a better inlining
73
std::optional<Error> validate(const request& r)
{
switch(r.type) {
case request::type_1:
if(/* simple check */)
return std::nullopt;
return /* complex error msg generation */;
case request::type_2:
if(/* simple check */)
return std::nullopt;
return /* complex error msg generation */;
case request::type_3:
if(/* simple check */)
return std::nullopt;
return /* complex error msg generation */;
// ...
}
throw std::logic_error("");
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Typical validation and error handling
74
std::optional<Error> validate(const request& r)
{
if(is_valid(r))
return std::nullopt;
return make_error(r)
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Isolating cold path
75
bool is_valid(const request& r)
{
switch(r.type) {
case request::type_1:
return /* simple check */;
case request::type_2:
return /* simple check */;
case request::type_3:
return /* simple check */;
// ...
}
throw std::logic_error("");
}
std::optional<Error> validate(const request& r)
{
if(is_valid(r))
return std::nullopt;
return make_error(r)
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Isolating cold path
75
bool is_valid(const request& r)
{
switch(r.type) {
case request::type_1:
return /* simple check */;
case request::type_2:
return /* simple check */;
case request::type_3:
return /* simple check */;
// ...
}
throw std::logic_error("");
}
Error make_error(const request& r)
{
switch(r.type) {
case request::type_1:
return /* complex error msg generation */;
case request::type_2:
return /* complex error msg generation */;
case request::type_3:
return /* complex error msg generation */;
// ...
}
throw std::logic_error("");
}
std::optional<Error> validate(const request& r)
{
if(is_valid(r))
return std::nullopt;
return make_error(r)
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Isolating cold path
75
if(expensiveCheck() && fastCheck()) { /* ... */ }
if(fastCheck() && expensiveCheck()) { /* ... */ }
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Expression short-circuiting
WRONG
GOOD
76
if(expensiveCheck() && fastCheck()) { /* ... */ }
if(fastCheck() && expensiveCheck()) { /* ... */ }
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Expression short-circuiting
WRONG
GOOD
Bail out as early as possible and continue fast path
76
int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
77
int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
foo(int):
mov eax, 10
ret
foo(unsigned int):
cmp edi, -10
sbb eax, eax
and eax, 10
ret
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
77
int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
foo(int):
mov eax, 10
ret
foo(unsigned int):
cmp edi, -10
sbb eax, eax
and eax, 10
ret
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
Integer arithmetic differs for the signed and unsigned integral types
77
• Avoid using std::list, std::map, etc
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
• Consider storing only pointers to objects in the container
• Consider storing expensive hash values with the key
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
• Consider storing only pointers to objects in the container
• Consider storing expensive hash values with the key
• Limit the number of type conversions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100%user space code)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100%user space code)
• Measure performance… ALWAYS
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The most important recommendation
80
Always measure your performance!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The most important recommendation
80
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
• Verify output assembly code
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
Flame Graph Search
test_bu..
x..
d..
s..
gener..
S..
_..
unary_..
execute_command_internal
x..
bash
shell_expand_word_list
red..
execute_builtin
__xst..
__dup2
main
_.. do_redirection_internal
do_redirections
do_f..
ext4_f..
__GI___li..
__..
expand_word_list_internal
unary_..
expan..tra..
path..
execute_command
two_ar..
cleanup_redi..
execute_command
do_sy..
trac..
__GI___libc_..
posixt..
execute_builtin_or_function
_..
[unknown]
tra..
expand_word_internal
_..
tracesys
ex..
_..
vfs_write
c..
test_co..
e..
u..
v..
SY..
i..
execute_simple_command
execute_command_internal
p.. gl..
s..sys_write
__GI___l..
expand_words
d..
__libc_start_main
d..
do_sync..
generi..
e..
reader_loop
sy..
execute_while_command
do_redirecti..
sys_o..
g..
__gene..
__xsta..
tracesys
c..
_..
execute_while_or_until
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Flamegraph
82
Low Latency Systems C
Low Latency Systems C

More Related Content

What's hot

Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Summit
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsAdithya Bhat
 
Latency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesLatency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesScyllaDB
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORSAmirthavalli Senthil
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Kuniyasu Suzaki
 
Software defined storage
Software defined storageSoftware defined storage
Software defined storageGluster.org
 
Build a Micro HTTP Server for Embedded System
Build a Micro HTTP Server for Embedded SystemBuild a Micro HTTP Server for Embedded System
Build a Micro HTTP Server for Embedded SystemJian-Hong Pan
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapPadraig O'Sullivan
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Scala dreaded underscore
Scala dreaded underscoreScala dreaded underscore
Scala dreaded underscoreRUDDER
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 

What's hot (20)

Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Latency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesLatency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed Databases
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Software defined storage
Software defined storageSoftware defined storage
Software defined storage
 
Build a Micro HTTP Server for Embedded System
Build a Micro HTTP Server for Embedded SystemBuild a Micro HTTP Server for Embedded System
Build a Micro HTTP Server for Embedded System
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Ipc ppt
Ipc pptIpc ppt
Ipc ppt
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Scala dreaded underscore
Scala dreaded underscoreScala dreaded underscore
Scala dreaded underscore
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Chapter 7
Chapter 7Chapter 7
Chapter 7
 

Similar to Low Latency Systems C

Striving for ultimate Low Latency
Striving for ultimate Low LatencyStriving for ultimate Low Latency
Striving for ultimate Low LatencyMateusz Pusz
 
Leverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applicationsLeverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applicationsLuca Mattia Ferrari
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipelineData Science Society
 
Unlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems EnvironmentUnlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems EnvironmentPrecisely
 
New Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit ThemNew Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit ThemPrecisely
 
Build Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfBuild Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfRussFustino
 
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...DevOps.com
 
Presentation CISCO & HP
Presentation CISCO & HPPresentation CISCO & HP
Presentation CISCO & HPAbhijat Dhawal
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017Precisely
 
Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames Anais Jackie Dotis
 
What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?Precisely
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017Precisely
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geodeimcpune
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksAreaNetworking.it
 
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Pasan De Silva
 
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Cloud Native Day Tel Aviv
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...QuantInsti
 
Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?MarketingArrowECS_CZ
 
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios
 

Similar to Low Latency Systems C (20)

Striving for ultimate Low Latency
Striving for ultimate Low LatencyStriving for ultimate Low Latency
Striving for ultimate Low Latency
 
Leverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applicationsLeverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applications
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
 
Unlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems EnvironmentUnlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems Environment
 
New Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit ThemNew Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit Them
 
Build Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfBuild Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdf
 
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
 
Presentation CISCO & HP
Presentation CISCO & HPPresentation CISCO & HP
Presentation CISCO & HP
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017
 
InfiniBox z pohledu zákazníka
InfiniBox z pohledu zákazníkaInfiniBox z pohledu zákazníka
InfiniBox z pohledu zákazníka
 
Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames
 
What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
 
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
 
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
 
Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?
 
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
 

More from corehard_by

C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...corehard_by
 
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...corehard_by
 
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений ОхотниковC++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотниковcorehard_by
 
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр ТитовC++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титовcorehard_by
 
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...corehard_by
 
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья ШишковC++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишковcorehard_by
 
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...corehard_by
 
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...corehard_by
 
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...corehard_by
 
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...corehard_by
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...corehard_by
 
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...corehard_by
 
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел ФилоновC++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филоновcorehard_by
 
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan ČukićC++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukićcorehard_by
 
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia KazakovaC++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakovacorehard_by
 
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон ПолухинC++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухинcorehard_by
 
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...corehard_by
 
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019corehard_by
 
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019corehard_by
 
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019corehard_by
 

More from corehard_by (20)

C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
 
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
 
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений ОхотниковC++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
 
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр ТитовC++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
 
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
 
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья ШишковC++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
 
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
 
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
 
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
 
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
 
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
 
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел ФилоновC++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
 
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan ČukićC++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
 
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia KazakovaC++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
 
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон ПолухинC++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
 
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
 
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
 
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019
 
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Low Latency Systems C

  • 1. Mateusz Pusz November 30, 2019 Striving for ultimate Low Latency INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
  • 2. Special thanks to Carl Cook and Chris Kohlhoff C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency ISO C++ Committee (WG21) Study Group 14 (SG14) 2
  • 3. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency vs Throughput 3
  • 4. Latency is the time required to perform some action or to produce some result. Measured in units of time like hours, minutes, seconds, nanoseconds, or clock periods. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency vs Throughput 3
  • 5. Latency is the time required to perform some action or to produce some result. Measured in units of time like hours, minutes, seconds, nanoseconds, or clock periods. Throughput is the number of such actions executed or results produced per unit of time. Measured in units of whatever is being produced per unit of time. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency vs Throughput 3
  • 6. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What do we mean by Low Latency? 4
  • 7. Low Latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What do we mean by Low Latency? 4
  • 8. Low Latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What do we mean by Low Latency? Especially important for internet connections utilizing services such as trading, online gaming, and VoIP 4
  • 9. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 10. • In VoIP substantial delays between input from conversation participants may impair their communication C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 11. • In VoIP substantial delays between input from conversation participants may impair their communication • In online gaming a player with a high latency internet connection may show slow responses in spite of superior tactics or appropriate reaction time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 12. • In VoIP substantial delays between input from conversation participants may impair their communication • In online gaming a player with a high latency internet connection may show slow responses in spite of superior tactics or appropriate reaction time • Within capital markets the proliferation of algorithmic trading requires firms to react to market events faster than the competition to increase profitability of trades C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 13. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 14. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 15. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 16. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 17. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours • Implies high turnover of capital (i.e. one's entire capital or more in a single day) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 18. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours • Implies high turnover of capital (i.e. one's entire capital or more in a single day) • Typically, the traders with the fastest execution speeds are more profitable C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 19. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 20. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 21. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 22. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 23. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 24. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 25. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 26. 1-10us 100-1000ns C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How fast do we do? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 8
  • 27. 1-10us 100-1000ns C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How fast do we do? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 8
  • 28. 1-10us 100-1000ns • Average human eye blink takes 350 000us (1/3s) • Millions of orders can be traded that time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How fast do we do? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 8
  • 29. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What if something goes wrong? 9
  • 30. • In 2012 was the largest trader in U.S. equities • Market share – 17.3%on NYSE – 16.9%on NASDAQ • Had approximately $365 million in cash and equivalents • Average daily trading volume – 3.3 billion trades – trading over 21 billion dollars C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What if something goes wrong? KNIGHT CAPITAL 9
  • 31. • In 2012 was the largest trader in U.S. equities • Market share – 17.3%on NYSE – 16.9%on NASDAQ • Had approximately $365 million in cash and equivalents • Average daily trading volume – 3.3 billion trades – trading over 21 billion dollars • pre-tax loss of $440 million in 45 minutes -- LinkedIn C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What if something goes wrong? KNIGHT CAPITAL 10
  • 32. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 33. 1 Low Latency network C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 34. 1 Low Latency network 2 Modern hardware C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 35. 1 Low Latency network 2 Modern hardware 3 BIOS profiling C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 36. 1 Low Latency network 2 Modern hardware 3 BIOS profiling 4 Kernel profiling C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 37. 1 Low Latency network 2 Modern hardware 3 BIOS profiling 4 Kernel profiling 5 OS profiling C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 38. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in 12
  • 39. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in SPIN 12
  • 40. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing • Assign CPU affinity • Assign interrupt affinity • Assign memory to NUMA nodes • Consider the physical location of NICs • Isolate cores from general OS use • Use a system with a single physical CPU C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in SPIN PIN 12
  • 41. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing • Assign CPU affinity • Assign interrupt affinity • Assign memory to NUMA nodes • Consider the physical location of NICs • Isolate cores from general OS use • Use a system with a single physical CPU • Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries • Use the kernel bypass library C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in SPIN PIN DROP-IN 12
  • 42. Let's scope on the software
  • 43. • Typically only a small part of code is really important (a fast/hot path) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 44. • Typically only a small part of code is really important (a fast/hot path) • This code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 45. • Typically only a small part of code is really important (a fast/hot path) • This code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) • Multithreading increases latency – we need low latency and not throughput – concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO, shares network C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 46. • Typically only a small part of code is really important (a fast/hot path) • This code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) • Multithreading increases latency – we need low latency and not throughput – concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO, shares network • Mistakes are really expensive – good error checking and recovery is mandatory – one second is 4 billion CPU instructions (a lot can happen that time) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 47. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop software that have predictable performance? 15
  • 48. It turns out that the more important question here is... C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop software that have predictable performance? 15
  • 49. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How NOT to develop software that have predictable performance? 16
  • 50. • In Low Latency system we care a lot about WCET (Worst Case Execution Time) • In order to limit WCET we should limit the usage of specific C++ language features C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How NOT to develop software that have predictable performance? 17
  • 51. • In Low Latency system we care a lot about WCET (Worst Case Execution Time) • In order to limit WCET we should limit the usage of specific C++ language features • A task not only for developers but also for code architects C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How NOT to develop software that have predictable performance? 17
  • 52. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 53. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 54. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 55. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 56. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism 4 Multiple inheritance C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 57. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism 4 Multiple inheritance 5 RTTI C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 58. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism 4 Multiple inheritance 5 RTTI 6 Dynamic memory allocations C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 59. template<class T> class shared_ptr; • Smart pointer that retains shared ownership of an object through a pointer • Several shared_ptr objects may own the same object • The shared object is destroyed and its memory deallocated when the last remaining shared_ptr owning that object is either destroyed or assigned another pointer via operator= or reset() • Supports user provided deleter C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency std::shared_ptr<T> 19
  • 60. template<class T> class shared_ptr; • Smart pointer that retains shared ownership of an object through a pointer • Several shared_ptr objects may own the same object • The shared object is destroyed and its memory deallocated when the last remaining shared_ptr owning that object is either destroyed or assigned another pointer via operator= or reset() • Supports user provided deleter C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency std::shared_ptr<T> Too o en overused by C++ programmers 19
  • 61. void foo() { std::unique_ptr<int> ptr{new int{1}}; // some code using 'ptr' } void foo() { std::shared_ptr<int> ptr{new int{1}}; // some code using 'ptr' } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Question: What is the di erence here? 20
  • 62. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Key std::shared_ptr<T> issues 21
  • 63. • Shared state – performance + memory footprint • Mandatory synchronization – performance • Type Erasure – performance • std::weak_ptr<T> support – memory footprint • Aliasing constructor – memory footprint C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Key std::shared_ptr<T> issues 21
  • 64. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency More info on code::dive 2016 22
  • 65. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 66. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 67. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 68. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 69. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time • Advantages of C++ exceptions usage – cannot be ignored! – simplify interfaces – (if not thrown) actually can improve application performance – make source code of likely path easier to reason about C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 70. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time • Advantages of C++ exceptions usage – cannot be ignored! – simplify interfaces – (if not thrown) actually can improve application performance – make source code of likely path easier to reason about C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions Not using C++ exceptions is not an excuse to write not exception-safe code! 23
  • 71. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism 24
  • 72. class base : noncopyable { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism DYNAMIC 24
  • 73. class base : noncopyable { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; • Additional pointer stored in an object • Extra indirection (pointer dereference) • O en not possible to devirtualize • Not inlined • Instruction cache miss C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism DYNAMIC 24
  • 74. class base : noncopyable { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; template<class Derived> class base { public: void process() { static_cast<Derived*>(this)->setup(); static_cast<Derived*>(this)->run(); static_cast<Derived*>(this)->cleanup(); } }; class derived : public base<derived> { friend class base<derived>; void setup() { /* ... */ } void run() { /* ... */ } void cleanup() { /* ... */ } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism DYNAMIC STATIC 25
  • 75. • this pointer adjustments needed to call member function (for not empty base classes) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Multiple inheritance MULTIPLE INHERITANCE 26
  • 76. • this pointer adjustments needed to call member function (for not empty base classes) • Virtual inheritance as an answer • virtual in C++ means "determined at runtime" • Extra indirection to access data members C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Multiple inheritance MULTIPLE INHERITANCE DIAMOND OF DREAD 27
  • 77. • this pointer adjustments needed to call member function (for not empty base classes) • Virtual inheritance as an answer • virtual in C++ means "determined at runtime" • Extra indirection to access data members Always prefer composition before inheritance! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Multiple inheritance MULTIPLE INHERITANCE DIAMOND OF DREAD 27
  • 78. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 28
  • 79. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 28
  • 80. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) O en the sign of a smelly design 28
  • 81. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } • Traversing an inheritance tree • Comparisons C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 29
  • 82. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } • Traversing an inheritance tree • Comparisons void foo(base& b) { if(typeid(b) == typeid(derived)) { derived* d = static_cast<derived*>(&b); d->boo(); } } • Only one comparison of std::type_info • O en only one runtime pointer compare C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 29
  • 83. • General purpose operation • Nondeterministic execution performance • Causes memory fragmentation • May fail (error handling is needed) • Memory leaks possible if not properly handled C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Dynamic memory allocations 30
  • 84. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Custom allocators to the rescue 31
  • 85. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory template<typename T> struct pool_allocator { T* allocate(std::size_t n); void deallocate(T* p, std::size_t n); }; using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Custom allocators to the rescue 31
  • 86. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory template<typename T> struct pool_allocator { T* allocate(std::size_t n); void deallocate(T* p, std::size_t n); }; using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Custom allocators to the rescue Preallocation makes the allocator jitter more stable, helps in keeping related data together and avoiding long term fragmentation. 31
  • 87. • Prevent dynamic memory allocation for the (common) case of dealing with small objects C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Small Object Optimization (SOO / SSO / SBO) 32
  • 88. • Prevent dynamic memory allocation for the (common) case of dealing with small objects class sso_string { char* data_ = u_.sso_; size_t size_ = 0; union { char sso_[16] = ""; size_t capacity_; } u_; public: [[nodiscard]] size_t capacity() const { return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Small Object Optimization (SOO / SSO / SBO) 32
  • 89. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency No dynamic allocation 33
  • 90. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; struct db_contact { inplace_string<7> symbol; inplace_string<15> name; inplace_string<15> surname; inplace_string<23> company; }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency No dynamic allocation 33
  • 91. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; struct db_contact { inplace_string<7> symbol; inplace_string<15> name; inplace_string<15> surname; inplace_string<23> company; }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency No dynamic allocation No dynamic memory allocations or pointer indirections guaranteed with the cost of possibly bigger memory usage 33
  • 92. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 93. 1 Use tools that improve efficiency without sacrificing performance C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 94. 1 Use tools that improve efficiency without sacrificing performance 2 Use compile time wherever possible C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 95. 1 Use tools that improve efficiency without sacrificing performance 2 Use compile time wherever possible 3 Know your hardware C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 96. 1 Use tools that improve efficiency without sacrificing performance 2 Use compile time wherever possible 3 Know your hardware 4 Clearly isolate cold code from the fast path C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 97. • static_assert() • Automatic type deduction • Type aliases • Move semantics • noexcept • constexpr • Lambda expressions • type_traits • std::string_view • Variadic templates • and many more... C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Examples of safe to use C++ features 35
  • 98. The fastest programs are those that do nothing C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Do you agree? 36
  • 99. int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } foo(factorial(n)); // not compile-time foo(factorial(4)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++98) 37
  • 100. int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } foo(factorial(n)); // not compile-time foo(factorial(4)); // not compile-time template<int N> struct Factorial { static const int value = N * Factorial<N - 1>::value; }; template<> struct Factorial<0> { static const int value = 1; }; int array[Factorial<4>::value]; const int precalc_values[] = { Factorial<0>::value, Factorial<1>::value, Factorial<2>::value }; foo(factorial<4>::value); // compile-time // foo(factorial<n>::value); // compile-time error C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++98) 37
  • 101. • 2 separate implementations – hard to keep in sync • Template metaprogramming is hard! • Easier to precalculate in Excel and hardcode results in the code ;-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Just too hard! 38
  • 102. • 2 separate implementations – hard to keep in sync • Template metaprogramming is hard! • Easier to precalculate in Excel and hardcode results in the code ;-) const int precalc_values[] = { 1, 1, 2, 6, 24, 120, 720, // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Just too hard! 38
  • 103. Declares that it is possible to evaluate the value of the function or variable at compile time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency constexpr speci er 39
  • 104. Declares that it is possible to evaluate the value of the function or variable at compile time • constexpr specifier used in an object declaration implies const • constexpr specifier used in a function or static member variable declaration implies inline C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency constexpr speci er 39
  • 105. Declares that it is possible to evaluate the value of the function or variable at compile time • constexpr specifier used in an object declaration implies const • constexpr specifier used in a function or static member variable declaration implies inline • Such variables and functions can be used in compile time constant expressions (provided that appropriate function arguments are given) constexpr int factorial(int n); std::array<int, factorial(4)> array; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency constexpr speci er 39
  • 106. • A constexpr function must not be virtual • The function body may contain only – static_assert declarations – alias declarations – using declarations and directives – exactly one return statement with an expression that does not mutate any of the objects constexpr int factorial(int n) { return n <= 1 ? 1 : (n * factorial(n - 1)); } std::array<int, factorial(4)> array; constexpr std::array<int, 3> precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24, "Error"); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++11) 40
  • 107. • A constexpr function must not be virtual • The function body must not contain – a goto statement and labels other than case and default – a try block – an asm declaration – variable of non-literal type – variable of static or thread storage duration – uninitialized variable – evaluation of a lambda expression – reinterpret_cast and dynamic_cast – changing an active member of an union – a new or delete expression constexpr int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array<int, 3> precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24, "Error"); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++14) 41
  • 108. • A constexpr function must not be virtual • The function body must not contain – a goto statement and labels other than case and default – a try block – an asm declaration – variable of non-literal type – variable of static or thread storage duration – uninitialized variable – reinterpret_cast and dynamic_cast – changing an active member of an union – a new or delete expression constexpr int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++17) 42
  • 109. • The function body must not contain – a goto statement and labels other than case and default – variable of non-literal type – variable of static or thread storage duration – uninitialized variable – reinterpret_cast constexpr int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++20) 43
  • 110. • The consteval specifier declares a function to be an immediate function • Every call must produce a compile time constant expression • An immediate function is a constexpr function consteval int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24); foo(factorial(4)); // compile-time // foo(factorial(n)); // compile-time error C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++20) 44
  • 111. template<typename T> struct is_array : std::false_type {}; template<typename T> struct is_array<T[]> : std::true_type {}; template<typename T> constexpr bool is_array_v = is_array<T>::value; static_assert(is_array_v<int> == false); static_assert(is_array_v<int[]>); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++14) 45
  • 112. template<typename T> struct is_array : std::false_type {}; template<typename T> struct is_array<T[]> : std::true_type {}; template<typename T> constexpr bool is_array_v = is_array<T>::value; static_assert(is_array_v<int> == false); static_assert(is_array_v<int[]>); template<typename T> class smart_ptr { void destroy(std::true_type) noexcept { delete[] ptr_; } void destroy(std::false_type) noexcept { delete ptr_; } void destroy() noexcept { destroy(is_array<T>()); } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++14) Tag dispatch provides the possibility to select the proper function overload in compile-time based on properties of a type 45
  • 113. template<typename T> inline constexpr bool is_array = false; template<typename T> inline constexpr bool is_array<T[]> = true; static_assert(is_array<int> == false); static_assert(is_array<int[]>); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++17) 46
  • 114. template<typename T> inline constexpr bool is_array = false; template<typename T> inline constexpr bool is_array<T[]> = true; static_assert(is_array<int> == false); static_assert(is_array<int[]>); template<typename T> class smart_ptr { void destroy() noexcept { if constexpr(is_array<T>) delete[] ptr_; else delete ptr_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++17) Usage of variable templates and if constexpr decrease compilation time 46
  • 115. struct X { int a, b, c; int id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 47
  • 116.
  • 117.
  • 118. struct X { int a, b, c; int id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 50
  • 119. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 51
  • 120. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } Ooops!!! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 51
  • 121. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { std::copy(begin(a1), end(a1), begin(a2)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 52
  • 122. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; movq %rsi, %rax movq 8(%rdi), %rdx movq (%rdi), %rsi cmpq %rsi, %rdx je .L1 movq (%rax), %rdi subq %rsi, %rdx jmp memmove // 100 lines of assembly code void foo(const std::vector<X>& a1, std::vector<X>& a2) { std::copy(begin(a1), end(a1), begin(a2)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 52
  • 123. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract 53
  • 124. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract 53
  • 125. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>); Compiler returned: 0 <source>:9:20: error: static assertion failed 9 | static_assert(std::is_trivially_copyable_v<X>); | ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ Compiler returned: 1 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract 53
  • 126. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>); Compiler returned: 0 <source>:9:20: error: static assertion failed 9 | static_assert(std::is_trivially_copyable_v<X>); | ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ Compiler returned: 1 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract Enforce compile-time contracts and class invariants with contracts or static_asserts 53
  • 127. enum class side { BID, ASK }; class order_book { template<side S> class book_side { std::vector<price> levels_; public: void insert(order o) { } bool match(price p) const { } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 128. enum class side { BID, ASK }; class order_book { template<side S> class book_side { using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>; std::vector<price> levels_; public: void insert(order o) { } bool match(price p) const { } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 129. enum class side { BID, ASK }; class order_book { template<side S> class book_side { using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>; std::vector<price> levels_; public: void insert(order o) { const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{}); if(it != end(levels_) && *it != o.price) levels_.insert(it, o.price); // ... } bool match(price p) const { } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 130. enum class side { BID, ASK }; class order_book { template<side S> class book_side { using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>; std::vector<price> levels_; public: void insert(order o) { const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{}); if(it != end(levels_) && *it != o.price) levels_.insert(it, o.price); // ... } bool match(price p) const { return compare{}(levels_.back(), p); } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 131. constexpr int array_size = 10'000; int array[array_size][array_size]; for(auto i = 0L; i < array_size; ++i) { for(auto j = 0L; j < array_size; ++j) { array[j][i] = i + j; } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? 55
  • 132. constexpr int array_size = 10'000; int array[array_size][array_size]; for(auto i = 0L; i < array_size; ++i) { for(auto j = 0L; j < array_size; ++j) { array[j][i] = i + j; } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? Reckless cache usage can cost you a lot of performance! 56
  • 133. Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz Stepping: 10 CPU MHz: 1700.035 CPU max MHz: 1700,0000 CPU min MHz: 400,0000 BogoMIPS: 3799.90 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 57
  • 134. Please everybody stand up :-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 135. Please everybody stand up :-) • Similar C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 136. Please everybody stand up :-) • Similar • < 2x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 137. Please everybody stand up :-) • Similar • < 2x • < 5x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 138. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 139. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 140. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x • < 50x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 141. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 142. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x • >= 100x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 143. • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x • >= 100x • column-major: 2490 ms • row-major: 51 ms • ratio: 47x Please everybody stand up :-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? GCC 7.4.0 58
  • 144. • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x • >= 100x • column-major: 2490 ms • row-major: 51 ms • ratio: 47x • column-major: 51 ms • row-major: 50 ms • ratio: Similar :-) Please everybody stand up :-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? GCC 7.4.0 GCC 9.2.1 58
  • 145. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 59
  • 146. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 59
  • 147. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 60
  • 148. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 61
  • 149. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 62
  • 150. struct coordinates { int x, y; }; void draw(const coordinates& coord); void verify(int threshold); constexpr int OBJECT_COUNT = 1'000'000; class objectMgr; void process(const objectMgr& mgr) { const auto size = mgr.size(); for(auto i = 0UL; i < size; ++i) { draw(mgr.position(i)); } for(auto i = 0UL; i < size; ++i) { verify(mgr.threshold(i)); } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Another example 63
  • 151. class objectMgr { struct object { coordinates coord; std::string errorTxt_1; std::string errorTxt_2; std::string errorTxt_3; int threshold; std::array<char, 100> otherData; }; std::vector<object> data_; public: explicit objectMgr(std::size_t size) : data_{size} {} std::size_t size() const { return data_.size(); } const coordinates& position(std::size_t idx) const { return data_[idx].coord; } int threshold(std::size_t idx) const { return data_[idx].threshold; } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Naive objectMgr implementation 64
  • 152. class objectMgr { struct object { coordinates coord; std::string errorTxt_1; std::string errorTxt_2; std::string errorTxt_3; int threshold; std::array<char, 100> otherData; }; std::vector<object> data_; public: explicit objectMgr(std::size_t size) : data_{size} {} std::size_t size() const { return data_.size(); } const coordinates& position(std::size_t idx) const { return data_[idx].coord; } int threshold(std::size_t idx) const { return data_[idx].threshold; } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Naive objectMgr implementation 64
  • 153. • Program optimization approach motivated by cache coherency • Focuses on data layout • Results in objects decomposition C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency DOD (Data-Oriented Design) 65
  • 154. • Program optimization approach motivated by cache coherency • Focuses on data layout • Results in objects decomposition C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency DOD (Data-Oriented Design) Keep data used together close to each other 65
  • 155. class objectMgr { std::vector<coordinates> positions_; std::vector<int> thresholds_; struct otherData { struct errorData { std::string errorTxt_1, errorTxt_2, errorTxt_3; }; errorData error; std::array<char, 100> data; }; std::vector<otherData> coldData_; public: explicit objectMgr(std::size_t size) : positions_{size}, thresholds_(size), coldData_{size} {} std::size_t size() const { return positions_.size(); } const coordinates& position(std::size_t idx) const { return positions_[idx]; } int threshold(std::size_t idx) const { return thresholds_[idx]; } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Object decomposition example 66
  • 156. struct A { char c; short s; int i; double d; static double dd; }; static_assert( sizeof(A) == ??); static_assert(alignof(A) == ??); struct B { char c; double d; short s; int i; static double dd; }; static_assert( sizeof(B) == ??); static_assert(alignof(B) == ??); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Members order might be important 67
  • 157. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct B { char c; // size=1, alignment=1, padding=7 double d; // size=8, alignment=8, padding=0 short s; // size=2, alignment=2, padding=2 int i; // size=4, alignment=4, padding=0 static double dd; }; static_assert( sizeof(B) == 24); static_assert(alignof(B) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Members order might be important 67
  • 158. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct B { char c; // size=1, alignment=1, padding=7 double d; // size=8, alignment=8, padding=0 short s; // size=2, alignment=2, padding=2 int i; // size=4, alignment=4, padding=0 static double dd; }; static_assert( sizeof(B) == 24); static_assert(alignof(B) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Members order might be important In order to satisfy alignment requirements of all non-static members of a class, padding may be inserted a er some of its members 67
  • 159. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects 68
  • 160. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); using opt = std::optional<A>; static_assert( sizeof(opt) == 24); static_assert(alignof(opt) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects 68
  • 161. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); using opt = std::optional<A>; static_assert( sizeof(opt) == 24); static_assert(alignof(opt) == 8); using array = std::array<opt, 256>; static_assert( sizeof(array) == 24 * 256); static_assert(alignof(array) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects 68
  • 162. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); using opt = std::optional<A>; static_assert( sizeof(opt) == 24); static_assert(alignof(opt) == 8); using array = std::array<opt, 256>; static_assert( sizeof(array) == 24 * 256); static_assert(alignof(array) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects Be aware of implementation details of the tools you use every day 68
  • 163. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct C { char c; double d; short s; static double dd; int i; } __attribute__((packed)); static_assert( sizeof(C) == 15); static_assert(alignof(C) == 1); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Packing 69
  • 164. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct C { char c; double d; short s; static double dd; int i; } __attribute__((packed)); static_assert( sizeof(C) == 15); static_assert(alignof(C) == 1); On modern hardware may be faster than aligned structure C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Packing 69
  • 165. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct C { char c; double d; short s; static double dd; int i; } __attribute__((packed)); static_assert( sizeof(C) == 15); static_assert(alignof(C) == 1); On modern hardware may be faster than aligned structure C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Packing Not portable! May be slower or even crash 69
  • 166. L1 cache reference 0.5 ns Branch misprediction 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD 1,000,000 ns 1 ms ~1GB/sec SSD, 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency Numbers Every Programmer Should Know 70
  • 167. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) { auto old_size = size(); auto largest_align = AlignOf<largest_scalar_t>(); reserved_ += (std::max)(len, growth_policy(reserved_)); // Round up to avoid undefined behavior from unaligned loads and stores. reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1); auto new_buf = allocator_.allocate(reserved_); auto new_cur = new_buf + reserved_ - old_size; memcpy(new_cur, cur_, old_size); cur_ = new_cur; allocator_.deallocate(buf_); buf_ = new_buf; } cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? 71
  • 168. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) { auto old_size = size(); auto largest_align = AlignOf<largest_scalar_t>(); reserved_ += (std::max)(len, growth_policy(reserved_)); // Round up to avoid undefined behavior from unaligned loads and stores. reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1); auto new_buf = allocator_.allocate(reserved_); auto new_cur = new_buf + reserved_ - old_size; memcpy(new_cur, cur_, old_size); cur_ = new_cur; allocator_.deallocate(buf_); buf_ = new_buf; } cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? 71
  • 169. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) reallocate(len); cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } void reallocate(size_t len) { auto old_size = size(); auto largest_align = AlignOf<largest_scalar_t>(); reserved_ += (std::max)(len, growth_policy(reserved_)); // Round up to avoid undefined behavior from unaligned loads and stores. reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1); auto new_buf = allocator_.allocate(reserved_); auto new_cur = new_buf + reserved_ - old_size; memcpy(new_cur, cur_, old_size); cur_ = new_cur; allocator_.deallocate(buf_); buf_ = new_buf; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Separate the code to hot and cold paths 72
  • 170. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) reallocate(len); cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; Code is data too! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Separate the code to hot and cold paths 73
  • 171. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) reallocate(len); cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; Code is data too! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Separate the code to hot and cold paths Performance improvement of 20%thanks to a better inlining 73
  • 172. std::optional<Error> validate(const request& r) { switch(r.type) { case request::type_1: if(/* simple check */) return std::nullopt; return /* complex error msg generation */; case request::type_2: if(/* simple check */) return std::nullopt; return /* complex error msg generation */; case request::type_3: if(/* simple check */) return std::nullopt; return /* complex error msg generation */; // ... } throw std::logic_error(""); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Typical validation and error handling 74
  • 173. std::optional<Error> validate(const request& r) { if(is_valid(r)) return std::nullopt; return make_error(r) } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Isolating cold path 75
  • 174. bool is_valid(const request& r) { switch(r.type) { case request::type_1: return /* simple check */; case request::type_2: return /* simple check */; case request::type_3: return /* simple check */; // ... } throw std::logic_error(""); } std::optional<Error> validate(const request& r) { if(is_valid(r)) return std::nullopt; return make_error(r) } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Isolating cold path 75
  • 175. bool is_valid(const request& r) { switch(r.type) { case request::type_1: return /* simple check */; case request::type_2: return /* simple check */; case request::type_3: return /* simple check */; // ... } throw std::logic_error(""); } Error make_error(const request& r) { switch(r.type) { case request::type_1: return /* complex error msg generation */; case request::type_2: return /* complex error msg generation */; case request::type_3: return /* complex error msg generation */; // ... } throw std::logic_error(""); } std::optional<Error> validate(const request& r) { if(is_valid(r)) return std::nullopt; return make_error(r) } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Isolating cold path 75
  • 176. if(expensiveCheck() && fastCheck()) { /* ... */ } if(fastCheck() && expensiveCheck()) { /* ... */ } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Expression short-circuiting WRONG GOOD 76
  • 177. if(expensiveCheck() && fastCheck()) { /* ... */ } if(fastCheck() && expensiveCheck()) { /* ... */ } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Expression short-circuiting WRONG GOOD Bail out as early as possible and continue fast path 76
  • 178. int foo(int i) { int k = 0; for(int j = i; j < i + 10; ++j) ++k; return k; } int foo(unsigned i) { int k = 0; for(unsigned j = i; j < i + 10; ++j) ++k; return k; } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Integer arithmetic 77
  • 179. int foo(int i) { int k = 0; for(int j = i; j < i + 10; ++j) ++k; return k; } int foo(unsigned i) { int k = 0; for(unsigned j = i; j < i + 10; ++j) ++k; return k; } foo(int): mov eax, 10 ret foo(unsigned int): cmp edi, -10 sbb eax, eax and eax, 10 ret C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Integer arithmetic 77
  • 180. int foo(int i) { int k = 0; for(int j = i; j < i + 10; ++j) ++k; return k; } int foo(unsigned i) { int k = 0; for(unsigned j = i; j < i + 10; ++j) ++k; return k; } foo(int): mov eax, 10 ret foo(unsigned int): cmp edi, -10 sbb eax, eax and eax, 10 ret C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Integer arithmetic Integer arithmetic differs for the signed and unsigned integral types 77
  • 181. • Avoid using std::list, std::map, etc C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 182. • Avoid using std::list, std::map, etc • std::unordered_map is also broken C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 183. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 184. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar • Preallocate storage – free list – plf::colony C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 185. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar • Preallocate storage – free list – plf::colony • Consider storing only pointers to objects in the container • Consider storing expensive hash values with the key C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 186. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar • Preallocate storage – free list – plf::colony • Consider storing only pointers to objects in the container • Consider storing expensive hash values with the key • Limit the number of type conversions C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 187. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 188. • Keep the number of threads close (less or equal) to the number of available physical CPU cores C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 189. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 190. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 191. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 192. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 193. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 194. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 195. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 196. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 197. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! • Bypass the kernel (100%user space code) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 198. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! • Bypass the kernel (100%user space code) • Measure performance… ALWAYS C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 199. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The most important recommendation 80
  • 200. Always measure your performance! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The most important recommendation 80
  • 201. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 202. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 203. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" .. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 204. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" .. • Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs • Use tools like Intel VTune C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 205. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" .. • Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs • Use tools like Intel VTune • Verify output assembly code C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 206. Flame Graph Search test_bu.. x.. d.. s.. gener.. S.. _.. unary_.. execute_command_internal x.. bash shell_expand_word_list red.. execute_builtin __xst.. __dup2 main _.. do_redirection_internal do_redirections do_f.. ext4_f.. __GI___li.. __.. expand_word_list_internal unary_.. expan..tra.. path.. execute_command two_ar.. cleanup_redi.. execute_command do_sy.. trac.. __GI___libc_.. posixt.. execute_builtin_or_function _.. [unknown] tra.. expand_word_internal _.. tracesys ex.. _.. vfs_write c.. test_co.. e.. u.. v.. SY.. i.. execute_simple_command execute_command_internal p.. gl.. s..sys_write __GI___l.. expand_words d.. __libc_start_main d.. do_sync.. generi.. e.. reader_loop sy.. execute_while_command do_redirecti.. sys_o.. g.. __gene.. __xsta.. tracesys c.. _.. execute_while_or_until C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Flamegraph 82