SlideShare a Scribd company logo
1 of 208
Download to read offline
Mateusz Pusz
November 30, 2019
Striving for ultimate
Low Latency
INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
Special thanks to Carl Cook and Chris Kohlhoff
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
ISO C++ Committee (WG21) Study Group 14 (SG14)
2
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds, or clock periods.
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds, or clock periods.
Throughput is the number of such actions executed or results
produced per unit of time. Measured in units of whatever is being
produced per unit of time.
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
4
Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
4
Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
Especially important for internet connections utilizing services such as
trading, online gaming, and VoIP
4
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
• In VoIP substantial delays between input from conversation participants may impair their
communication
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
• In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or appropriate reaction time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
• In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or appropriate reaction time
• Within capital markets the proliferation of algorithmic trading requires firms to react to market events
faster than the competition to increase profitability of trades
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
• Typically, the traders with the fastest execution speeds are more profitable
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
1-10us 100-1000ns
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
1-10us 100-1000ns
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
1-10us 100-1000ns
• Average human eye blink takes 350 000us (1/3s)
• Millions of orders can be traded that time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
9
• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3%on NYSE
– 16.9%on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
KNIGHT CAPITAL
9
• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3%on NYSE
– 16.9%on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
• pre-tax loss of $440 million in 45 minutes
-- LinkedIn
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
KNIGHT CAPITAL
10
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
3 BIOS profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
3 BIOS profiling
4 Kernel profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
1 Low Latency network
2 Modern hardware
3 BIOS profiling
4 Kernel profiling
5 OS profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
12
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN
12
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU affinity
• Assign interrupt affinity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN PIN
12
• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU affinity
• Assign interrupt affinity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
• Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries
• Use the kernel bypass library
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN PIN
DROP-IN
12
Let's scope on the software
• Typically only a small part of code is really important (a fast/hot path)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
• Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
• Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– we need low latency and not throughput
– concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
• Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– we need low latency and not throughput
– concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
• Mistakes are really expensive
– good error checking and recovery is mandatory
– one second is 4 billion CPU instructions (a lot can happen that time)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop software that have predictable performance?
15
It turns out that the more important question here is...
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop software that have predictable performance?
15
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
16
• In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
17
• In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
• A task not only for developers but also for
code architects
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
17
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
6 Dynamic memory allocations
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Supports user provided deleter
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
std::shared_ptr<T>
19
template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Supports user provided deleter
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
std::shared_ptr<T>
Too o en overused by C++ programmers
19
void foo()
{
std::unique_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
void foo()
{
std::shared_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Question: What is the di erence here?
20
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Key std::shared_ptr<T> issues
21
• Shared state
– performance + memory footprint
• Mandatory synchronization
– performance
• Type Erasure
– performance
• std::weak_ptr<T> support
– memory footprint
• Aliasing constructor
– memory footprint
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Key std::shared_ptr<T> issues
21
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
More info on code::dive 2016
22
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– cannot be ignored!
– simplify interfaces
– (if not thrown) actually can improve application performance
– make source code of likely path easier to reason about
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– cannot be ignored!
– simplify interfaces
– (if not thrown) actually can improve application performance
– make source code of likely path easier to reason about
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
Not using C++ exceptions is not an excuse to write not exception-safe code!
23
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
24
class base : noncopyable {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
DYNAMIC
24
class base : noncopyable {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
• Additional pointer stored in an object
• Extra indirection (pointer dereference)
• O en not possible to devirtualize
• Not inlined
• Instruction cache miss
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
DYNAMIC
24
class base : noncopyable {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
template<class Derived>
class base {
public:
void process()
{
static_cast<Derived*>(this)->setup();
static_cast<Derived*>(this)->run();
static_cast<Derived*>(this)->cleanup();
}
};
class derived : public base<derived> {
friend class base<derived>;
void setup() { /* ... */ }
void run() { /* ... */ }
void cleanup() { /* ... */ }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
DYNAMIC STATIC
25
• this pointer adjustments needed to call
member function (for not empty base classes)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
26
• this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
DIAMOND OF DREAD
27
• this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
Always prefer composition before inheritance!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
DIAMOND OF DREAD
27
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
28
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
28
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
O en the sign of a smelly design
28
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
29
class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
void foo(base& b)
{
if(typeid(b) == typeid(derived)) {
derived* d = static_cast<derived*>(&b);
d->boo();
}
}
• Only one comparison of std::type_info
• O en only one runtime pointer compare
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
29
• General purpose operation
• Nondeterministic execution performance
• Causes memory fragmentation
• May fail (error handling is needed)
• Memory leaks possible if not properly handled
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Dynamic memory allocations
30
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
31
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
31
• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
Preallocation makes the allocator jitter more stable, helps in keeping related
data together and avoiding long term fragmentation.
31
• Prevent dynamic memory allocation for the (common) case of dealing with small objects
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Small Object Optimization (SOO / SSO / SBO)
32
• Prevent dynamic memory allocation for the (common) case of dealing with small objects
class sso_string {
char* data_ = u_.sso_;
size_t size_ = 0;
union {
char sso_[16] = "";
size_t capacity_;
} u_;
public:
[[nodiscard]] size_t capacity() const
{
return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Small Object Optimization (SOO / SSO / SBO)
32
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
33
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
33
template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
No dynamic memory allocations or pointer indirections guaranteed with the
cost of possibly bigger memory usage
33
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
3 Know your hardware
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
3 Know your hardware
4 Clearly isolate cold code from the fast path
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
• static_assert()
• Automatic type deduction
• Type aliases
• Move semantics
• noexcept
• constexpr
• Lambda expressions
• type_traits
• std::string_view
• Variadic templates
• and many more...
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Examples of safe to use C++ features
35
The fastest programs are those that do nothing
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Do you agree?
36
int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
foo(factorial(n)); // not compile-time
foo(factorial(4)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++98)
37
int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
foo(factorial(n)); // not compile-time
foo(factorial(4)); // not compile-time
template<int N>
struct Factorial {
static const int value =
N * Factorial<N - 1>::value;
};
template<>
struct Factorial<0> {
static const int value = 1;
};
int array[Factorial<4>::value];
const int precalc_values[] = {
Factorial<0>::value,
Factorial<1>::value,
Factorial<2>::value
};
foo(factorial<4>::value); // compile-time
// foo(factorial<n>::value); // compile-time error
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++98)
37
• 2 separate implementations
– hard to keep in sync
• Template metaprogramming is hard!
• Easier to precalculate in Excel and hardcode
results in the code ;-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Just too hard!
38
• 2 separate implementations
– hard to keep in sync
• Template metaprogramming is hard!
• Easier to precalculate in Excel and hardcode
results in the code ;-)
const int precalc_values[] = {
1,
1,
2,
6,
24,
120,
720,
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Just too hard!
38
Declares that it is possible to evaluate the value of the function or
variable at compile time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
Declares that it is possible to evaluate the value of the function or
variable at compile time
• constexpr specifier used in an object declaration implies const
• constexpr specifier used in a function or static member variable declaration implies inline
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
Declares that it is possible to evaluate the value of the function or
variable at compile time
• constexpr specifier used in an object declaration implies const
• constexpr specifier used in a function or static member variable declaration implies inline
• Such variables and functions can be used in compile time constant expressions (provided that
appropriate function arguments are given)
constexpr int factorial(int n);
std::array<int, factorial(4)> array;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
• A constexpr function must not be virtual
• The function body may contain only
– static_assert declarations
– alias declarations
– using declarations and directives
– exactly one return statement with an
expression that does not mutate any of the
objects
constexpr int factorial(int n)
{
return n <= 1 ? 1 : (n * factorial(n - 1));
}
std::array<int, factorial(4)> array;
constexpr std::array<int, 3> precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24, "Error");
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++11)
40
• A constexpr function must not be virtual
• The function body must not contain
– a goto statement and labels other than case
and default
– a try block
– an asm declaration
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– evaluation of a lambda expression
– reinterpret_cast and dynamic_cast
– changing an active member of an union
– a new or delete expression
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array<int, 3> precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24, "Error");
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++14)
41
• A constexpr function must not be virtual
• The function body must not contain
– a goto statement and labels other than case
and default
– a try block
– an asm declaration
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– reinterpret_cast and dynamic_cast
– changing an active member of an union
– a new or delete expression
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++17)
42
• The function body must not contain
– a goto statement and labels other than case
and default
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– reinterpret_cast
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++20)
43
• The consteval specifier declares a function to
be an immediate function
• Every call must produce a compile time
constant expression
• An immediate function is a constexpr function
consteval int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // compile-time
// foo(factorial(n)); // compile-time error
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++20)
44
template<typename T>
struct is_array : std::false_type {};
template<typename T>
struct is_array<T[]> : std::true_type {};
template<typename T>
constexpr bool is_array_v = is_array<T>::value;
static_assert(is_array_v<int> == false);
static_assert(is_array_v<int[]>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++14)
45
template<typename T>
struct is_array : std::false_type {};
template<typename T>
struct is_array<T[]> : std::true_type {};
template<typename T>
constexpr bool is_array_v = is_array<T>::value;
static_assert(is_array_v<int> == false);
static_assert(is_array_v<int[]>);
template<typename T>
class smart_ptr {
void destroy(std::true_type) noexcept
{ delete[] ptr_; }
void destroy(std::false_type) noexcept
{ delete ptr_; }
void destroy() noexcept
{ destroy(is_array<T>()); }
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++14)
Tag dispatch provides the possibility to select the proper function overload in
compile-time based on properties of a type
45
template<typename T>
inline constexpr bool is_array = false;
template<typename T>
inline constexpr bool is_array<T[]> = true;
static_assert(is_array<int> == false);
static_assert(is_array<int[]>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++17)
46
template<typename T>
inline constexpr bool is_array = false;
template<typename T>
inline constexpr bool is_array<T[]> = true;
static_assert(is_array<int> == false);
static_assert(is_array<int[]>);
template<typename T>
class smart_ptr {
void destroy() noexcept
{
if constexpr(is_array<T>)
delete[] ptr_;
else
delete ptr_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++17)
Usage of variable templates and if constexpr decrease compilation time
46
struct X {
int a, b, c;
int id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
47
struct X {
int a, b, c;
int id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
50
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
51
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
Ooops!!!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
51
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
std::copy(begin(a1), end(a1), begin(a2));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
52
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
movq %rsi, %rax
movq 8(%rdi), %rdx
movq (%rdi), %rsi
cmpq %rsi, %rdx
je .L1
movq (%rax), %rdi
subq %rsi, %rdx
jmp memmove
// 100 lines of assembly code
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
std::copy(begin(a1), end(a1), begin(a2));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
52
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
Compiler returned: 0 <source>:9:20: error: static assertion failed
9 | static_assert(std::is_trivially_copyable_v<X>);
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
Compiler returned: 0 <source>:9:20: error: static assertion failed
9 | static_assert(std::is_trivially_copyable_v<X>);
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
Enforce compile-time contracts and class invariants with contracts or
static_asserts
53
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
std::vector<price> levels_;
public:
void insert(order o)
{
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{});
if(it != end(levels_) && *it != o.price)
levels_.insert(it, o.price);
// ...
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{});
if(it != end(levels_) && *it != o.price)
levels_.insert(it, o.price);
// ...
}
bool match(price p) const
{
return compare{}(levels_.back(), p);
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
constexpr int array_size = 10'000;
int array[array_size][array_size];
for(auto i = 0L; i < array_size; ++i) {
for(auto j = 0L; j < array_size; ++j) {
array[j][i] = i + j;
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
55
constexpr int array_size = 10'000;
int array[array_size][array_size];
for(auto i = 0L; i < array_size; ++i) {
for(auto j = 0L; j < array_size; ++j) {
array[j][i] = i + j;
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
Reckless cache usage can cost you a lot of performance!
56
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
Stepping: 10
CPU MHz: 1700.035
CPU max MHz: 1700,0000
CPU min MHz: 400,0000
BogoMIPS: 3799.90
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
57
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
• column-major: 2490 ms
• row-major: 51 ms
• ratio: 47x
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
GCC 7.4.0
58
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
• column-major: 2490 ms
• row-major: 51 ms
• ratio: 47x
• column-major: 51 ms
• row-major: 50 ms
• ratio: Similar :-)
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
GCC 7.4.0
GCC 9.2.1
58
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
59
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
59
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
60
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
61
3 274,28 msec task-clock # 0,999 CPUs utilized
5 536 529 085 cycles # 1,691 GHz (66,53%)
1 246 414 976 instructions # 0,23 insn per cycle (83,27%)
92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%)
410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%)
87 191 333 LLC-loads # 26,629 M/sec (83,39%)
527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%)
97 781 page-faults # 0,030 M/sec
360,48 msec task-clock # 0,999 CPUs utilized
609 105 177 cycles # 1,690 GHz (66,71%)
891 120 837 instructions # 1,46 insn per cycle (83,36%)
90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%)
14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%)
167 850 LLC-loads # 0,466 M/sec (83,36%)
79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%)
97 783 page-faults # 0,271 M/sec
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
CPU data cache
62
struct coordinates { int x, y; };
void draw(const coordinates& coord);
void verify(int threshold);
constexpr int OBJECT_COUNT = 1'000'000;
class objectMgr;
void process(const objectMgr& mgr)
{
const auto size = mgr.size();
for(auto i = 0UL; i < size; ++i) { draw(mgr.position(i)); }
for(auto i = 0UL; i < size; ++i) { verify(mgr.threshold(i)); }
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Another example
63
class objectMgr {
struct object {
coordinates coord;
std::string errorTxt_1;
std::string errorTxt_2;
std::string errorTxt_3;
int threshold;
std::array<char, 100> otherData;
};
std::vector<object> data_;
public:
explicit objectMgr(std::size_t size) : data_{size} {}
std::size_t size() const { return data_.size(); }
const coordinates& position(std::size_t idx) const { return data_[idx].coord; }
int threshold(std::size_t idx) const { return data_[idx].threshold; }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Naive objectMgr implementation
64
class objectMgr {
struct object {
coordinates coord;
std::string errorTxt_1;
std::string errorTxt_2;
std::string errorTxt_3;
int threshold;
std::array<char, 100> otherData;
};
std::vector<object> data_;
public:
explicit objectMgr(std::size_t size) : data_{size} {}
std::size_t size() const { return data_.size(); }
const coordinates& position(std::size_t idx) const { return data_[idx].coord; }
int threshold(std::size_t idx) const { return data_[idx].threshold; }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Naive objectMgr implementation
64
• Program optimization approach motivated by cache coherency
• Focuses on data layout
• Results in objects decomposition
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
DOD (Data-Oriented Design)
65
• Program optimization approach motivated by cache coherency
• Focuses on data layout
• Results in objects decomposition
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
DOD (Data-Oriented Design)
Keep data used together close to each other
65
class objectMgr {
std::vector<coordinates> positions_;
std::vector<int> thresholds_;
struct otherData {
struct errorData { std::string errorTxt_1, errorTxt_2, errorTxt_3; };
errorData error;
std::array<char, 100> data;
};
std::vector<otherData> coldData_;
public:
explicit objectMgr(std::size_t size) :
positions_{size}, thresholds_(size), coldData_{size} {}
std::size_t size() const { return positions_.size(); }
const coordinates& position(std::size_t idx) const { return positions_[idx]; }
int threshold(std::size_t idx) const { return thresholds_[idx]; }
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Object decomposition example
66
struct A {
char c;
short s;
int i;
double d;
static double dd;
};
static_assert( sizeof(A) == ??);
static_assert(alignof(A) == ??);
struct B {
char c;
double d;
short s;
int i;
static double dd;
};
static_assert( sizeof(B) == ??);
static_assert(alignof(B) == ??);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
67
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct B {
char c; // size=1, alignment=1, padding=7
double d; // size=8, alignment=8, padding=0
short s; // size=2, alignment=2, padding=2
int i; // size=4, alignment=4, padding=0
static double dd;
};
static_assert( sizeof(B) == 24);
static_assert(alignof(B) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
67
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct B {
char c; // size=1, alignment=1, padding=7
double d; // size=8, alignment=8, padding=0
short s; // size=2, alignment=2, padding=2
int i; // size=4, alignment=4, padding=0
static double dd;
};
static_assert( sizeof(B) == 24);
static_assert(alignof(B) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
In order to satisfy alignment requirements of all non-static members of a
class, padding may be inserted a er some of its members
67
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
using array = std::array<opt, 256>;
static_assert( sizeof(array) == 24 * 256);
static_assert(alignof(array) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
using array = std::array<opt, 256>;
static_assert( sizeof(array) == 24 * 256);
static_assert(alignof(array) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
Be aware of implementation details of the tools you use every day
68
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
69
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
On modern hardware may be faster than aligned structure
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
69
struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
On modern hardware may be faster than aligned structure
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
Not portable! May be slower or even crash
69
L1 cache reference 0.5 ns
Branch misprediction 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD 1,000,000 ns 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
https://gist.github.com/jboner/2841832
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency Numbers Every Programmer Should Know
70
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_)) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
71
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_)) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
71
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
void reallocate(size_t len) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
72
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
Code is data too!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
73
class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
Code is data too!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
Performance improvement of 20%thanks to a better inlining
73
std::optional<Error> validate(const request& r)
{
switch(r.type) {
case request::type_1:
if(/* simple check */)
return std::nullopt;
return /* complex error msg generation */;
case request::type_2:
if(/* simple check */)
return std::nullopt;
return /* complex error msg generation */;
case request::type_3:
if(/* simple check */)
return std::nullopt;
return /* complex error msg generation */;
// ...
}
throw std::logic_error("");
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Typical validation and error handling
74
std::optional<Error> validate(const request& r)
{
if(is_valid(r))
return std::nullopt;
return make_error(r)
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Isolating cold path
75
bool is_valid(const request& r)
{
switch(r.type) {
case request::type_1:
return /* simple check */;
case request::type_2:
return /* simple check */;
case request::type_3:
return /* simple check */;
// ...
}
throw std::logic_error("");
}
std::optional<Error> validate(const request& r)
{
if(is_valid(r))
return std::nullopt;
return make_error(r)
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Isolating cold path
75
bool is_valid(const request& r)
{
switch(r.type) {
case request::type_1:
return /* simple check */;
case request::type_2:
return /* simple check */;
case request::type_3:
return /* simple check */;
// ...
}
throw std::logic_error("");
}
Error make_error(const request& r)
{
switch(r.type) {
case request::type_1:
return /* complex error msg generation */;
case request::type_2:
return /* complex error msg generation */;
case request::type_3:
return /* complex error msg generation */;
// ...
}
throw std::logic_error("");
}
std::optional<Error> validate(const request& r)
{
if(is_valid(r))
return std::nullopt;
return make_error(r)
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Isolating cold path
75
if(expensiveCheck() && fastCheck()) { /* ... */ }
if(fastCheck() && expensiveCheck()) { /* ... */ }
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Expression short-circuiting
WRONG
GOOD
76
if(expensiveCheck() && fastCheck()) { /* ... */ }
if(fastCheck() && expensiveCheck()) { /* ... */ }
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Expression short-circuiting
WRONG
GOOD
Bail out as early as possible and continue fast path
76
int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
77
int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
foo(int):
mov eax, 10
ret
foo(unsigned int):
cmp edi, -10
sbb eax, eax
and eax, 10
ret
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
77
int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
foo(int):
mov eax, 10
ret
foo(unsigned int):
cmp edi, -10
sbb eax, eax
and eax, 10
ret
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
Integer arithmetic differs for the signed and unsigned integral types
77
• Avoid using std::list, std::map, etc
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
• Consider storing only pointers to objects in the container
• Consider storing expensive hash values with the key
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
• Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
• Consider storing only pointers to objects in the container
• Consider storing expensive hash values with the key
• Limit the number of type conversions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100%user space code)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
• Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100%user space code)
• Measure performance… ALWAYS
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The most important recommendation
80
Always measure your performance!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The most important recommendation
80
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
• Verify output assembly code
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
Flame Graph Search
test_bu..
x..
d..
s..
gener..
S..
_..
unary_..
execute_command_internal
x..
bash
shell_expand_word_list
red..
execute_builtin
__xst..
__dup2
main
_.. do_redirection_internal
do_redirections
do_f..
ext4_f..
__GI___li..
__..
expand_word_list_internal
unary_..
expan..tra..
path..
execute_command
two_ar..
cleanup_redi..
execute_command
do_sy..
trac..
__GI___libc_..
posixt..
execute_builtin_or_function
_..
[unknown]
tra..
expand_word_internal
_..
tracesys
ex..
_..
vfs_write
c..
test_co..
e..
u..
v..
SY..
i..
execute_simple_command
execute_command_internal
p.. gl..
s..sys_write
__GI___l..
expand_words
d..
__libc_start_main
d..
do_sync..
generi..
e..
reader_loop
sy..
execute_while_command
do_redirecti..
sys_o..
g..
__gene..
__xsta..
tracesys
c..
_..
execute_while_or_until
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Flamegraph
82
Low Latency Systems C
Low Latency Systems C

More Related Content

What's hot

What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?Michelle Holley
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUAMD
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編Fixstars Corporation
 
素数判定法
素数判定法素数判定法
素数判定法DEGwer
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例Fixstars Corporation
 
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール智啓 出川
 
/etc/network/interfaces について
/etc/network/interfaces について/etc/network/interfaces について
/etc/network/interfaces についてKazuhiro Nishiyama
 
Prometheus Monitoring Mixins
Prometheus Monitoring MixinsPrometheus Monitoring Mixins
Prometheus Monitoring MixinsGrafana Labs
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Shinya Takamaeda-Y
 
AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解MITSUNARI Shigeo
 
C++のビルド高速化について
C++のビルド高速化についてC++のビルド高速化について
C++のビルド高速化についてAimingStudy
 
3GPP LTE introduction 1(Architecture & Identification)
3GPP LTE introduction  1(Architecture & Identification)3GPP LTE introduction  1(Architecture & Identification)
3GPP LTE introduction 1(Architecture & Identification)Ryuichi Yasunaga
 
FPGAスタートアップ資料
FPGAスタートアップ資料FPGAスタートアップ資料
FPGAスタートアップ資料marsee101
 
Vivado hls勉強会3(axi4 lite slave)
Vivado hls勉強会3(axi4 lite slave)Vivado hls勉強会3(axi4 lite slave)
Vivado hls勉強会3(axi4 lite slave)marsee101
 
Zynq mp勉強会資料
Zynq mp勉強会資料Zynq mp勉強会資料
Zynq mp勉強会資料一路 川染
 

What's hot (20)

What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
E cpri protocol stack
E cpri protocol stackE cpri protocol stack
E cpri protocol stack
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
 
Asterisk: the future is at REST
Asterisk: the future is at RESTAsterisk: the future is at REST
Asterisk: the future is at REST
 
素数判定法
素数判定法素数判定法
素数判定法
 
astricon2018
astricon2018astricon2018
astricon2018
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例
 
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
 
/etc/network/interfaces について
/etc/network/interfaces について/etc/network/interfaces について
/etc/network/interfaces について
 
Prometheus Monitoring Mixins
Prometheus Monitoring MixinsPrometheus Monitoring Mixins
Prometheus Monitoring Mixins
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
 
AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解
 
C++のビルド高速化について
C++のビルド高速化についてC++のビルド高速化について
C++のビルド高速化について
 
3GPP LTE introduction 1(Architecture & Identification)
3GPP LTE introduction  1(Architecture & Identification)3GPP LTE introduction  1(Architecture & Identification)
3GPP LTE introduction 1(Architecture & Identification)
 
FPGAスタートアップ資料
FPGAスタートアップ資料FPGAスタートアップ資料
FPGAスタートアップ資料
 
Vivado hls勉強会3(axi4 lite slave)
Vivado hls勉強会3(axi4 lite slave)Vivado hls勉強会3(axi4 lite slave)
Vivado hls勉強会3(axi4 lite slave)
 
Zynq mp勉強会資料
Zynq mp勉強会資料Zynq mp勉強会資料
Zynq mp勉強会資料
 

Similar to Low Latency Systems C

Striving for ultimate Low Latency
Striving for ultimate Low LatencyStriving for ultimate Low Latency
Striving for ultimate Low LatencyMateusz Pusz
 
Leverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applicationsLeverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applicationsLuca Mattia Ferrari
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipelineData Science Society
 
Unlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems EnvironmentUnlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems EnvironmentPrecisely
 
New Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit ThemNew Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit ThemPrecisely
 
Build Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfBuild Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfRussFustino
 
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...DevOps.com
 
Presentation CISCO & HP
Presentation CISCO & HPPresentation CISCO & HP
Presentation CISCO & HPAbhijat Dhawal
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017Precisely
 
Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames Anais Jackie Dotis
 
What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?Precisely
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017Precisely
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geodeimcpune
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksAreaNetworking.it
 
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Pasan De Silva
 
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Cloud Native Day Tel Aviv
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...QuantInsti
 
Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?MarketingArrowECS_CZ
 
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios
 

Similar to Low Latency Systems C (20)

Striving for ultimate Low Latency
Striving for ultimate Low LatencyStriving for ultimate Low Latency
Striving for ultimate Low Latency
 
Leverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applicationsLeverage event streaming framework to build intelligent applications
Leverage event streaming framework to build intelligent applications
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
 
Unlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems EnvironmentUnlock Cost Savings for your IBM z Systems Environment
Unlock Cost Savings for your IBM z Systems Environment
 
New Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit ThemNew Mainframe Developments and How Syncsort Helps You Exploit Them
New Mainframe Developments and How Syncsort Helps You Exploit Them
 
Build Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfBuild Blockchain dApps using JavaScript, Python and C - ATO.pdf
Build Blockchain dApps using JavaScript, Python and C - ATO.pdf
 
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
 
Presentation CISCO & HP
Presentation CISCO & HPPresentation CISCO & HP
Presentation CISCO & HP
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017
 
InfiniBox z pohledu zákazníka
InfiniBox z pohledu zákazníkaInfiniBox z pohledu zákazníka
InfiniBox z pohledu zákazníka
 
Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames Time Series Databases and Pandas DataFrames
Time Series Databases and Pandas DataFrames
 
What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?What’s Going on in Your 4HRA?
What’s Going on in Your 4HRA?
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
 
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...
 
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
 
Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?Garance 100% dostupnosti dat! Kdo z vás to má?
Garance 100% dostupnosti dat! Kdo z vás to má?
 
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
 

More from corehard_by

C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...corehard_by
 
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...corehard_by
 
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений ОхотниковC++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотниковcorehard_by
 
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр ТитовC++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титовcorehard_by
 
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...corehard_by
 
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья ШишковC++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишковcorehard_by
 
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...corehard_by
 
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...corehard_by
 
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...corehard_by
 
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...corehard_by
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...corehard_by
 
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...corehard_by
 
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел ФилоновC++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филоновcorehard_by
 
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan ČukićC++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukićcorehard_by
 
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia KazakovaC++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakovacorehard_by
 
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон ПолухинC++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухинcorehard_by
 
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...corehard_by
 
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019corehard_by
 
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019corehard_by
 
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019corehard_by
 

More from corehard_by (20)

C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
C++ CoreHard Autumn 2018. Создание пакетов для открытых библиотек через conan...
 
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
C++ CoreHard Autumn 2018. Что должен знать каждый C++ программист или Как про...
 
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений ОхотниковC++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
C++ CoreHard Autumn 2018. Actors vs CSP vs Tasks vs ... - Евгений Охотников
 
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр ТитовC++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
C++ CoreHard Autumn 2018. Знай свое "железо": иерархия памяти - Александр Титов
 
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
C++ CoreHard Autumn 2018. Информационная безопасность и разработка ПО - Евген...
 
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья ШишковC++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
C++ CoreHard Autumn 2018. Заглядываем под капот «Поясов по C++» - Илья Шишков
 
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
C++ CoreHard Autumn 2018. Ускорение сборки C++ проектов, способы и последстви...
 
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
C++ CoreHard Autumn 2018. Метаклассы: воплощаем мечты в реальность - Сергей С...
 
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
C++ CoreHard Autumn 2018. Что не умеет оптимизировать компилятор - Александр ...
 
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
C++ CoreHard Autumn 2018. Кодогенерация C++ кроссплатформенно. Продолжение - ...
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
 
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
C++ CoreHard Autumn 2018. Обработка списков на C++ в функциональном стиле - В...
 
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел ФилоновC++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
 
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan ČukićC++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
C++ CoreHard Autumn 2018. Asynchronous programming with ranges - Ivan Čukić
 
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia KazakovaC++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
C++ CoreHard Autumn 2018. Debug C++ Without Running - Anastasia Kazakova
 
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон ПолухинC++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
C++ CoreHard Autumn 2018. Полезный constexpr - Антон Полухин
 
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
 
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
Исключительная модель памяти. Алексей Ткаченко ➠ CoreHard Autumn 2019
 
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019Как помочь и как помешать компилятору. Андрей Олейников ➠  CoreHard Autumn 2019
Как помочь и как помешать компилятору. Андрей Олейников ➠ CoreHard Autumn 2019
 
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019Автоматизируй это. Кирилл Тихонов ➠  CoreHard Autumn 2019
Автоматизируй это. Кирилл Тихонов ➠ CoreHard Autumn 2019
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Low Latency Systems C

  • 1. Mateusz Pusz November 30, 2019 Striving for ultimate Low Latency INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
  • 2. Special thanks to Carl Cook and Chris Kohlhoff C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency ISO C++ Committee (WG21) Study Group 14 (SG14) 2
  • 3. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency vs Throughput 3
  • 4. Latency is the time required to perform some action or to produce some result. Measured in units of time like hours, minutes, seconds, nanoseconds, or clock periods. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency vs Throughput 3
  • 5. Latency is the time required to perform some action or to produce some result. Measured in units of time like hours, minutes, seconds, nanoseconds, or clock periods. Throughput is the number of such actions executed or results produced per unit of time. Measured in units of whatever is being produced per unit of time. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency vs Throughput 3
  • 6. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What do we mean by Low Latency? 4
  • 7. Low Latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What do we mean by Low Latency? 4
  • 8. Low Latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What do we mean by Low Latency? Especially important for internet connections utilizing services such as trading, online gaming, and VoIP 4
  • 9. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 10. • In VoIP substantial delays between input from conversation participants may impair their communication C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 11. • In VoIP substantial delays between input from conversation participants may impair their communication • In online gaming a player with a high latency internet connection may show slow responses in spite of superior tactics or appropriate reaction time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 12. • In VoIP substantial delays between input from conversation participants may impair their communication • In online gaming a player with a high latency internet connection may show slow responses in spite of superior tactics or appropriate reaction time • Within capital markets the proliferation of algorithmic trading requires firms to react to market events faster than the competition to increase profitability of trades C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Why do we strive for Low latency? 5
  • 13. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 14. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 15. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 16. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 17. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours • Implies high turnover of capital (i.e. one's entire capital or more in a single day) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 18. A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds -- Investopedia • Using complex algorithms to analyze multiple markets and execute orders based on market conditions • Buying and selling of securities many times over a period of time (o en hundreds of times an hour) • Done to profit from time-sensitive opportunities that arise during trading hours • Implies high turnover of capital (i.e. one's entire capital or more in a single day) • Typically, the traders with the fastest execution speeds are more profitable C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency High-Frequency Trading (HFT) 6
  • 19. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 20. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 21. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 22. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 23. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 24. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 25. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Market data processing 7
  • 26. 1-10us 100-1000ns C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How fast do we do? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 8
  • 27. 1-10us 100-1000ns C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How fast do we do? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 8
  • 28. 1-10us 100-1000ns • Average human eye blink takes 350 000us (1/3s) • Millions of orders can be traded that time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How fast do we do? ALL SOFTWARE APPROACH ALL HARDWARE APPROACH 8
  • 29. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What if something goes wrong? 9
  • 30. • In 2012 was the largest trader in U.S. equities • Market share – 17.3%on NYSE – 16.9%on NASDAQ • Had approximately $365 million in cash and equivalents • Average daily trading volume – 3.3 billion trades – trading over 21 billion dollars C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What if something goes wrong? KNIGHT CAPITAL 9
  • 31. • In 2012 was the largest trader in U.S. equities • Market share – 17.3%on NYSE – 16.9%on NASDAQ • Had approximately $365 million in cash and equivalents • Average daily trading volume – 3.3 billion trades – trading over 21 billion dollars • pre-tax loss of $440 million in 45 minutes -- LinkedIn C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What if something goes wrong? KNIGHT CAPITAL 10
  • 32. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 33. 1 Low Latency network C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 34. 1 Low Latency network 2 Modern hardware C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 35. 1 Low Latency network 2 Modern hardware 3 BIOS profiling C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 36. 1 Low Latency network 2 Modern hardware 3 BIOS profiling 4 Kernel profiling C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 37. 1 Low Latency network 2 Modern hardware 3 BIOS profiling 4 Kernel profiling 5 OS profiling C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ often not the most important part of the system 11
  • 38. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in 12
  • 39. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in SPIN 12
  • 40. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing • Assign CPU affinity • Assign interrupt affinity • Assign memory to NUMA nodes • Consider the physical location of NICs • Isolate cores from general OS use • Use a system with a single physical CPU C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in SPIN PIN 12
  • 41. • Don't sleep • Don't context switch • Prefer single-threaded scheduling • Disable locking and thread support • Disable power management • Disable C-states • Disable interrupt coalescing • Assign CPU affinity • Assign interrupt affinity • Assign memory to NUMA nodes • Consider the physical location of NICs • Isolate cores from general OS use • Use a system with a single physical CPU • Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries • Use the kernel bypass library C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Spin, pin, and drop-in SPIN PIN DROP-IN 12
  • 42. Let's scope on the software
  • 43. • Typically only a small part of code is really important (a fast/hot path) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 44. • Typically only a small part of code is really important (a fast/hot path) • This code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 45. • Typically only a small part of code is really important (a fast/hot path) • This code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) • Multithreading increases latency – we need low latency and not throughput – concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO, shares network C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 46. • Typically only a small part of code is really important (a fast/hot path) • This code is not executed o en • When it is executed it has to – start and finish as soon as possible – have predictable and reproducible performance (low jitter) • Multithreading increases latency – we need low latency and not throughput – concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO, shares network • Mistakes are really expensive – good error checking and recovery is mandatory – one second is 4 billion CPU instructions (a lot can happen that time) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Characteristics of Low Latency software 14
  • 47. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop software that have predictable performance? 15
  • 48. It turns out that the more important question here is... C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop software that have predictable performance? 15
  • 49. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How NOT to develop software that have predictable performance? 16
  • 50. • In Low Latency system we care a lot about WCET (Worst Case Execution Time) • In order to limit WCET we should limit the usage of specific C++ language features C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How NOT to develop software that have predictable performance? 17
  • 51. • In Low Latency system we care a lot about WCET (Worst Case Execution Time) • In order to limit WCET we should limit the usage of specific C++ language features • A task not only for developers but also for code architects C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How NOT to develop software that have predictable performance? 17
  • 52. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 53. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 54. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 55. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 56. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism 4 Multiple inheritance C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 57. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism 4 Multiple inheritance 5 RTTI C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 58. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>) 2 Throwing exceptions on a likely code path 3 Dynamic polymorphism 4 Multiple inheritance 5 RTTI 6 Dynamic memory allocations C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to avoid on fast path 18
  • 59. template<class T> class shared_ptr; • Smart pointer that retains shared ownership of an object through a pointer • Several shared_ptr objects may own the same object • The shared object is destroyed and its memory deallocated when the last remaining shared_ptr owning that object is either destroyed or assigned another pointer via operator= or reset() • Supports user provided deleter C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency std::shared_ptr<T> 19
  • 60. template<class T> class shared_ptr; • Smart pointer that retains shared ownership of an object through a pointer • Several shared_ptr objects may own the same object • The shared object is destroyed and its memory deallocated when the last remaining shared_ptr owning that object is either destroyed or assigned another pointer via operator= or reset() • Supports user provided deleter C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency std::shared_ptr<T> Too o en overused by C++ programmers 19
  • 61. void foo() { std::unique_ptr<int> ptr{new int{1}}; // some code using 'ptr' } void foo() { std::shared_ptr<int> ptr{new int{1}}; // some code using 'ptr' } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Question: What is the di erence here? 20
  • 62. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Key std::shared_ptr<T> issues 21
  • 63. • Shared state – performance + memory footprint • Mandatory synchronization – performance • Type Erasure – performance • std::weak_ptr<T> support – memory footprint • Aliasing constructor – memory footprint C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Key std::shared_ptr<T> issues 21
  • 64. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency More info on code::dive 2016 22
  • 65. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 66. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 67. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 68. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 69. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time • Advantages of C++ exceptions usage – cannot be ignored! – simplify interfaces – (if not thrown) actually can improve application performance – make source code of likely path easier to reason about C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions 23
  • 70. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++ exceptions • ... if they are not thrown • Throwing an exception can take significant and not deterministic time • Advantages of C++ exceptions usage – cannot be ignored! – simplify interfaces – (if not thrown) actually can improve application performance – make source code of likely path easier to reason about C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency C++ exceptions Not using C++ exceptions is not an excuse to write not exception-safe code! 23
  • 71. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism 24
  • 72. class base : noncopyable { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism DYNAMIC 24
  • 73. class base : noncopyable { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; • Additional pointer stored in an object • Extra indirection (pointer dereference) • O en not possible to devirtualize • Not inlined • Instruction cache miss C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism DYNAMIC 24
  • 74. class base : noncopyable { virtual void setup() = 0; virtual void run() = 0; virtual void cleanup() = 0; public: virtual ~base() = default; void process() { setup(); run(); cleanup(); } }; class derived : public base { void setup() override { /* ... */ } void run() override { /* ... */ } void cleanup() override { /* ... */ } }; template<class Derived> class base { public: void process() { static_cast<Derived*>(this)->setup(); static_cast<Derived*>(this)->run(); static_cast<Derived*>(this)->cleanup(); } }; class derived : public base<derived> { friend class base<derived>; void setup() { /* ... */ } void run() { /* ... */ } void cleanup() { /* ... */ } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Polymorphism DYNAMIC STATIC 25
  • 75. • this pointer adjustments needed to call member function (for not empty base classes) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Multiple inheritance MULTIPLE INHERITANCE 26
  • 76. • this pointer adjustments needed to call member function (for not empty base classes) • Virtual inheritance as an answer • virtual in C++ means "determined at runtime" • Extra indirection to access data members C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Multiple inheritance MULTIPLE INHERITANCE DIAMOND OF DREAD 27
  • 77. • this pointer adjustments needed to call member function (for not empty base classes) • Virtual inheritance as an answer • virtual in C++ means "determined at runtime" • Extra indirection to access data members Always prefer composition before inheritance! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Multiple inheritance MULTIPLE INHERITANCE DIAMOND OF DREAD 27
  • 78. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 28
  • 79. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 28
  • 80. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) O en the sign of a smelly design 28
  • 81. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } • Traversing an inheritance tree • Comparisons C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 29
  • 82. class base : noncopyable { public: virtual ~base() = default; virtual void foo() = 0; }; class derived : public base { public: void foo() override; void boo(); }; void foo(base& b) { derived* d = dynamic_cast<derived*>(&b); if(d) { d->boo(); } } • Traversing an inheritance tree • Comparisons void foo(base& b) { if(typeid(b) == typeid(derived)) { derived* d = static_cast<derived*>(&b); d->boo(); } } • Only one comparison of std::type_info • O en only one runtime pointer compare C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency RunTime Type Identi cation (RTTI) 29
  • 83. • General purpose operation • Nondeterministic execution performance • Causes memory fragmentation • May fail (error handling is needed) • Memory leaks possible if not properly handled C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Dynamic memory allocations 30
  • 84. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Custom allocators to the rescue 31
  • 85. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory template<typename T> struct pool_allocator { T* allocate(std::size_t n); void deallocate(T* p, std::size_t n); }; using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Custom allocators to the rescue 31
  • 86. • Address specific needs (functionality and hardware constrains) • Typically low number of dynamic memory allocations • Data structures needed to manage big chunks of memory template<typename T> struct pool_allocator { T* allocate(std::size_t n); void deallocate(T* p, std::size_t n); }; using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Custom allocators to the rescue Preallocation makes the allocator jitter more stable, helps in keeping related data together and avoiding long term fragmentation. 31
  • 87. • Prevent dynamic memory allocation for the (common) case of dealing with small objects C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Small Object Optimization (SOO / SSO / SBO) 32
  • 88. • Prevent dynamic memory allocation for the (common) case of dealing with small objects class sso_string { char* data_ = u_.sso_; size_t size_ = 0; union { char sso_[16] = ""; size_t capacity_; } u_; public: [[nodiscard]] size_t capacity() const { return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Small Object Optimization (SOO / SSO / SBO) 32
  • 89. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency No dynamic allocation 33
  • 90. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; struct db_contact { inplace_string<7> symbol; inplace_string<15> name; inplace_string<15> surname; inplace_string<23> company; }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency No dynamic allocation 33
  • 91. template<std::size_t MaxSize> class inplace_string { std::array<value_type, MaxSize + 1> chars_; public: // string-like interface }; struct db_contact { inplace_string<7> symbol; inplace_string<15> name; inplace_string<15> surname; inplace_string<23> company; }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency No dynamic allocation No dynamic memory allocations or pointer indirections guaranteed with the cost of possibly bigger memory usage 33
  • 92. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 93. 1 Use tools that improve efficiency without sacrificing performance C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 94. 1 Use tools that improve efficiency without sacrificing performance 2 Use compile time wherever possible C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 95. 1 Use tools that improve efficiency without sacrificing performance 2 Use compile time wherever possible 3 Know your hardware C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 96. 1 Use tools that improve efficiency without sacrificing performance 2 Use compile time wherever possible 3 Know your hardware 4 Clearly isolate cold code from the fast path C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The list of things to do on the fast path 34
  • 97. • static_assert() • Automatic type deduction • Type aliases • Move semantics • noexcept • constexpr • Lambda expressions • type_traits • std::string_view • Variadic templates • and many more... C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Examples of safe to use C++ features 35
  • 98. The fastest programs are those that do nothing C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Do you agree? 36
  • 99. int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } foo(factorial(n)); // not compile-time foo(factorial(4)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++98) 37
  • 100. int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } foo(factorial(n)); // not compile-time foo(factorial(4)); // not compile-time template<int N> struct Factorial { static const int value = N * Factorial<N - 1>::value; }; template<> struct Factorial<0> { static const int value = 1; }; int array[Factorial<4>::value]; const int precalc_values[] = { Factorial<0>::value, Factorial<1>::value, Factorial<2>::value }; foo(factorial<4>::value); // compile-time // foo(factorial<n>::value); // compile-time error C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++98) 37
  • 101. • 2 separate implementations – hard to keep in sync • Template metaprogramming is hard! • Easier to precalculate in Excel and hardcode results in the code ;-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Just too hard! 38
  • 102. • 2 separate implementations – hard to keep in sync • Template metaprogramming is hard! • Easier to precalculate in Excel and hardcode results in the code ;-) const int precalc_values[] = { 1, 1, 2, 6, 24, 120, 720, // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Just too hard! 38
  • 103. Declares that it is possible to evaluate the value of the function or variable at compile time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency constexpr speci er 39
  • 104. Declares that it is possible to evaluate the value of the function or variable at compile time • constexpr specifier used in an object declaration implies const • constexpr specifier used in a function or static member variable declaration implies inline C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency constexpr speci er 39
  • 105. Declares that it is possible to evaluate the value of the function or variable at compile time • constexpr specifier used in an object declaration implies const • constexpr specifier used in a function or static member variable declaration implies inline • Such variables and functions can be used in compile time constant expressions (provided that appropriate function arguments are given) constexpr int factorial(int n); std::array<int, factorial(4)> array; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency constexpr speci er 39
  • 106. • A constexpr function must not be virtual • The function body may contain only – static_assert declarations – alias declarations – using declarations and directives – exactly one return statement with an expression that does not mutate any of the objects constexpr int factorial(int n) { return n <= 1 ? 1 : (n * factorial(n - 1)); } std::array<int, factorial(4)> array; constexpr std::array<int, 3> precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24, "Error"); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++11) 40
  • 107. • A constexpr function must not be virtual • The function body must not contain – a goto statement and labels other than case and default – a try block – an asm declaration – variable of non-literal type – variable of static or thread storage duration – uninitialized variable – evaluation of a lambda expression – reinterpret_cast and dynamic_cast – changing an active member of an union – a new or delete expression constexpr int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array<int, 3> precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24, "Error"); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++14) 41
  • 108. • A constexpr function must not be virtual • The function body must not contain – a goto statement and labels other than case and default – a try block – an asm declaration – variable of non-literal type – variable of static or thread storage duration – uninitialized variable – reinterpret_cast and dynamic_cast – changing an active member of an union – a new or delete expression constexpr int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++17) 42
  • 109. • The function body must not contain – a goto statement and labels other than case and default – variable of non-literal type – variable of static or thread storage duration – uninitialized variable – reinterpret_cast constexpr int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24); foo(factorial(4)); // maybe compile-time foo(factorial(n)); // not compile-time C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++20) 43
  • 110. • The consteval specifier declares a function to be an immediate function • Every call must produce a compile time constant expression • An immediate function is a constexpr function consteval int factorial(int n) { int result = n; while(n > 1) result *= --n; return result; } std::array<int, factorial(4)> array; constexpr std::array precalc_values = { factorial(0), factorial(1), factorial(2) }; static_assert(factorial(4) == 24); foo(factorial(4)); // compile-time // foo(factorial(n)); // compile-time error C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time calculations (C++20) 44
  • 111. template<typename T> struct is_array : std::false_type {}; template<typename T> struct is_array<T[]> : std::true_type {}; template<typename T> constexpr bool is_array_v = is_array<T>::value; static_assert(is_array_v<int> == false); static_assert(is_array_v<int[]>); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++14) 45
  • 112. template<typename T> struct is_array : std::false_type {}; template<typename T> struct is_array<T[]> : std::true_type {}; template<typename T> constexpr bool is_array_v = is_array<T>::value; static_assert(is_array_v<int> == false); static_assert(is_array_v<int[]>); template<typename T> class smart_ptr { void destroy(std::true_type) noexcept { delete[] ptr_; } void destroy(std::false_type) noexcept { delete ptr_; } void destroy() noexcept { destroy(is_array<T>()); } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++14) Tag dispatch provides the possibility to select the proper function overload in compile-time based on properties of a type 45
  • 113. template<typename T> inline constexpr bool is_array = false; template<typename T> inline constexpr bool is_array<T[]> = true; static_assert(is_array<int> == false); static_assert(is_array<int[]>); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++17) 46
  • 114. template<typename T> inline constexpr bool is_array = false; template<typename T> inline constexpr bool is_array<T[]> = true; static_assert(is_array<int> == false); static_assert(is_array<int[]>); template<typename T> class smart_ptr { void destroy() noexcept { if constexpr(is_array<T>) delete[] ptr_; else delete ptr_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Compile-time dispatch (C++17) Usage of variable templates and if constexpr decrease compilation time 46
  • 115. struct X { int a, b, c; int id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 47
  • 116.
  • 117.
  • 118. struct X { int a, b, c; int id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 50
  • 119. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 51
  • 120. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { memcpy(a2.data(), a1.data(), a1.size() * sizeof(X)); } Ooops!!! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 51
  • 121. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; void foo(const std::vector<X>& a1, std::vector<X>& a2) { std::copy(begin(a1), end(a1), begin(a2)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 52
  • 122. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; movq %rsi, %rax movq 8(%rdi), %rdx movq (%rdi), %rsi cmpq %rsi, %rdx je .L1 movq (%rax), %rdi subq %rsi, %rdx jmp memmove // 100 lines of assembly code void foo(const std::vector<X>& a1, std::vector<X>& a2) { std::copy(begin(a1), end(a1), begin(a2)); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: A negative overhead abstraction 52
  • 123. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract 53
  • 124. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract 53
  • 125. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>); Compiler returned: 0 <source>:9:20: error: static assertion failed 9 | static_assert(std::is_trivially_copyable_v<X>); | ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ Compiler returned: 1 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract 53
  • 126. struct X { int a, b, c; int id; }; struct X { int a, b, c; std::string id; }; static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>); Compiler returned: 0 <source>:9:20: error: static assertion failed 9 | static_assert(std::is_trivially_copyable_v<X>); | ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ Compiler returned: 1 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time contract Enforce compile-time contracts and class invariants with contracts or static_asserts 53
  • 127. enum class side { BID, ASK }; class order_book { template<side S> class book_side { std::vector<price> levels_; public: void insert(order o) { } bool match(price p) const { } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 128. enum class side { BID, ASK }; class order_book { template<side S> class book_side { using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>; std::vector<price> levels_; public: void insert(order o) { } bool match(price p) const { } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 129. enum class side { BID, ASK }; class order_book { template<side S> class book_side { using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>; std::vector<price> levels_; public: void insert(order o) { const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{}); if(it != end(levels_) && *it != o.price) levels_.insert(it, o.price); // ... } bool match(price p) const { } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 130. enum class side { BID, ASK }; class order_book { template<side S> class book_side { using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>; std::vector<price> levels_; public: void insert(order o) { const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{}); if(it != end(levels_) && *it != o.price) levels_.insert(it, o.price); // ... } bool match(price p) const { return compare{}(levels_.back(), p); } // ... }; book_side<side::BID> bids_; book_side<side::ASK> asks_; // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Type traits: Compile-time branching 54
  • 131. constexpr int array_size = 10'000; int array[array_size][array_size]; for(auto i = 0L; i < array_size; ++i) { for(auto j = 0L; j < array_size; ++j) { array[j][i] = i + j; } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? 55
  • 132. constexpr int array_size = 10'000; int array[array_size][array_size]; for(auto i = 0L; i < array_size; ++i) { for(auto j = 0L; j < array_size; ++j) { array[j][i] = i + j; } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? Reckless cache usage can cost you a lot of performance! 56
  • 133. Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz Stepping: 10 CPU MHz: 1700.035 CPU max MHz: 1700,0000 CPU min MHz: 400,0000 BogoMIPS: 3799.90 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 57
  • 134. Please everybody stand up :-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 135. Please everybody stand up :-) • Similar C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 136. Please everybody stand up :-) • Similar • < 2x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 137. Please everybody stand up :-) • Similar • < 2x • < 5x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 138. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 139. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 140. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x • < 50x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 141. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 142. Please everybody stand up :-) • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x • >= 100x C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? 58
  • 143. • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x • >= 100x • column-major: 2490 ms • row-major: 51 ms • ratio: 47x Please everybody stand up :-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? GCC 7.4.0 58
  • 144. • Similar • < 2x • < 5x • < 10x • < 20x • < 50x • < 100x • >= 100x • column-major: 2490 ms • row-major: 51 ms • ratio: 47x • column-major: 51 ms • row-major: 50 ms • ratio: Similar :-) Please everybody stand up :-) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Quiz: How much slower is the bad case? GCC 7.4.0 GCC 9.2.1 58
  • 145. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 59
  • 146. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 59
  • 147. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 60
  • 148. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 61
  • 149. 3 274,28 msec task-clock # 0,999 CPUs utilized 5 536 529 085 cycles # 1,691 GHz (66,53%) 1 246 414 976 instructions # 0,23 insn per cycle (83,27%) 92 872 234 L1-dcache-loads # 28,364 M/sec (83,29%) 410 636 090 L1-dcache-load-misses # 442,15% of all L1-dcache hits (83,39%) 87 191 333 LLC-loads # 26,629 M/sec (83,39%) 527 256 LLC-load-misses # 0,60% of all LL-cache hits (83,41%) 97 781 page-faults # 0,030 M/sec 360,48 msec task-clock # 0,999 CPUs utilized 609 105 177 cycles # 1,690 GHz (66,71%) 891 120 837 instructions # 1,46 insn per cycle (83,36%) 90 219 862 L1-dcache-loads # 250,279 M/sec (83,35%) 14 481 217 L1-dcache-load-misses # 16,05% of all L1-dcache hits (83,35%) 167 850 LLC-loads # 0,466 M/sec (83,36%) 79 492 LLC-load-misses # 47,36% of all LL-cache hits (83,22%) 97 783 page-faults # 0,271 M/sec C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency CPU data cache 62
  • 150. struct coordinates { int x, y; }; void draw(const coordinates& coord); void verify(int threshold); constexpr int OBJECT_COUNT = 1'000'000; class objectMgr; void process(const objectMgr& mgr) { const auto size = mgr.size(); for(auto i = 0UL; i < size; ++i) { draw(mgr.position(i)); } for(auto i = 0UL; i < size; ++i) { verify(mgr.threshold(i)); } } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Another example 63
  • 151. class objectMgr { struct object { coordinates coord; std::string errorTxt_1; std::string errorTxt_2; std::string errorTxt_3; int threshold; std::array<char, 100> otherData; }; std::vector<object> data_; public: explicit objectMgr(std::size_t size) : data_{size} {} std::size_t size() const { return data_.size(); } const coordinates& position(std::size_t idx) const { return data_[idx].coord; } int threshold(std::size_t idx) const { return data_[idx].threshold; } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Naive objectMgr implementation 64
  • 152. class objectMgr { struct object { coordinates coord; std::string errorTxt_1; std::string errorTxt_2; std::string errorTxt_3; int threshold; std::array<char, 100> otherData; }; std::vector<object> data_; public: explicit objectMgr(std::size_t size) : data_{size} {} std::size_t size() const { return data_.size(); } const coordinates& position(std::size_t idx) const { return data_[idx].coord; } int threshold(std::size_t idx) const { return data_[idx].threshold; } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Naive objectMgr implementation 64
  • 153. • Program optimization approach motivated by cache coherency • Focuses on data layout • Results in objects decomposition C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency DOD (Data-Oriented Design) 65
  • 154. • Program optimization approach motivated by cache coherency • Focuses on data layout • Results in objects decomposition C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency DOD (Data-Oriented Design) Keep data used together close to each other 65
  • 155. class objectMgr { std::vector<coordinates> positions_; std::vector<int> thresholds_; struct otherData { struct errorData { std::string errorTxt_1, errorTxt_2, errorTxt_3; }; errorData error; std::array<char, 100> data; }; std::vector<otherData> coldData_; public: explicit objectMgr(std::size_t size) : positions_{size}, thresholds_(size), coldData_{size} {} std::size_t size() const { return positions_.size(); } const coordinates& position(std::size_t idx) const { return positions_[idx]; } int threshold(std::size_t idx) const { return thresholds_[idx]; } }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Object decomposition example 66
  • 156. struct A { char c; short s; int i; double d; static double dd; }; static_assert( sizeof(A) == ??); static_assert(alignof(A) == ??); struct B { char c; double d; short s; int i; static double dd; }; static_assert( sizeof(B) == ??); static_assert(alignof(B) == ??); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Members order might be important 67
  • 157. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct B { char c; // size=1, alignment=1, padding=7 double d; // size=8, alignment=8, padding=0 short s; // size=2, alignment=2, padding=2 int i; // size=4, alignment=4, padding=0 static double dd; }; static_assert( sizeof(B) == 24); static_assert(alignof(B) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Members order might be important 67
  • 158. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct B { char c; // size=1, alignment=1, padding=7 double d; // size=8, alignment=8, padding=0 short s; // size=2, alignment=2, padding=2 int i; // size=4, alignment=4, padding=0 static double dd; }; static_assert( sizeof(B) == 24); static_assert(alignof(B) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Members order might be important In order to satisfy alignment requirements of all non-static members of a class, padding may be inserted a er some of its members 67
  • 159. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects 68
  • 160. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); using opt = std::optional<A>; static_assert( sizeof(opt) == 24); static_assert(alignof(opt) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects 68
  • 161. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); using opt = std::optional<A>; static_assert( sizeof(opt) == 24); static_assert(alignof(opt) == 8); using array = std::array<opt, 256>; static_assert( sizeof(array) == 24 * 256); static_assert(alignof(array) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects 68
  • 162. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); using opt = std::optional<A>; static_assert( sizeof(opt) == 24); static_assert(alignof(opt) == 8); using array = std::array<opt, 256>; static_assert( sizeof(array) == 24 * 256); static_assert(alignof(array) == 8); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Be aware of alignment side e ects Be aware of implementation details of the tools you use every day 68
  • 163. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct C { char c; double d; short s; static double dd; int i; } __attribute__((packed)); static_assert( sizeof(C) == 15); static_assert(alignof(C) == 1); C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Packing 69
  • 164. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct C { char c; double d; short s; static double dd; int i; } __attribute__((packed)); static_assert( sizeof(C) == 15); static_assert(alignof(C) == 1); On modern hardware may be faster than aligned structure C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Packing 69
  • 165. struct A { char c; // size=1, alignment=1, padding=1 short s; // size=2, alignment=2, padding=0 int i; // size=4, alignment=4, padding=0 double d; // size=8, alignment=8, padding=0 static double dd; }; static_assert( sizeof(A) == 16); static_assert(alignof(A) == 8); struct C { char c; double d; short s; static double dd; int i; } __attribute__((packed)); static_assert( sizeof(C) == 15); static_assert(alignof(C) == 1); On modern hardware may be faster than aligned structure C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Packing Not portable! May be slower or even crash 69
  • 166. L1 cache reference 0.5 ns Branch misprediction 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD 1,000,000 ns 1 ms ~1GB/sec SSD, 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832 C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Latency Numbers Every Programmer Should Know 70
  • 167. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) { auto old_size = size(); auto largest_align = AlignOf<largest_scalar_t>(); reserved_ += (std::max)(len, growth_policy(reserved_)); // Round up to avoid undefined behavior from unaligned loads and stores. reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1); auto new_buf = allocator_.allocate(reserved_); auto new_cur = new_buf + reserved_ - old_size; memcpy(new_cur, cur_, old_size); cur_ = new_cur; allocator_.deallocate(buf_); buf_ = new_buf; } cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? 71
  • 168. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) { auto old_size = size(); auto largest_align = AlignOf<largest_scalar_t>(); reserved_ += (std::max)(len, growth_policy(reserved_)); // Round up to avoid undefined behavior from unaligned loads and stores. reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1); auto new_buf = allocator_.allocate(reserved_); auto new_cur = new_buf + reserved_ - old_size; memcpy(new_cur, cur_, old_size); cur_ = new_cur; allocator_.deallocate(buf_); buf_ = new_buf; } cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency What is wrong here? 71
  • 169. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) reallocate(len); cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } void reallocate(size_t len) { auto old_size = size(); auto largest_align = AlignOf<largest_scalar_t>(); reserved_ += (std::max)(len, growth_policy(reserved_)); // Round up to avoid undefined behavior from unaligned loads and stores. reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1); auto new_buf = allocator_.allocate(reserved_); auto new_cur = new_buf + reserved_ - old_size; memcpy(new_cur, cur_, old_size); cur_ = new_cur; allocator_.deallocate(buf_); buf_ = new_buf; } // ... }; C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Separate the code to hot and cold paths 72
  • 170. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) reallocate(len); cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; Code is data too! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Separate the code to hot and cold paths 73
  • 171. class vector_downward { uint8_t *make_space(size_t len) { if(len > static_cast<size_t>(cur_ - buf_)) reallocate(len); cur_ -= len; // Beyond this, signed offsets may not have enough range: // (FlatBuffers > 2GB not supported). assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE); return cur_; } // ... }; Code is data too! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Separate the code to hot and cold paths Performance improvement of 20%thanks to a better inlining 73
  • 172. std::optional<Error> validate(const request& r) { switch(r.type) { case request::type_1: if(/* simple check */) return std::nullopt; return /* complex error msg generation */; case request::type_2: if(/* simple check */) return std::nullopt; return /* complex error msg generation */; case request::type_3: if(/* simple check */) return std::nullopt; return /* complex error msg generation */; // ... } throw std::logic_error(""); } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Typical validation and error handling 74
  • 173. std::optional<Error> validate(const request& r) { if(is_valid(r)) return std::nullopt; return make_error(r) } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Isolating cold path 75
  • 174. bool is_valid(const request& r) { switch(r.type) { case request::type_1: return /* simple check */; case request::type_2: return /* simple check */; case request::type_3: return /* simple check */; // ... } throw std::logic_error(""); } std::optional<Error> validate(const request& r) { if(is_valid(r)) return std::nullopt; return make_error(r) } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Isolating cold path 75
  • 175. bool is_valid(const request& r) { switch(r.type) { case request::type_1: return /* simple check */; case request::type_2: return /* simple check */; case request::type_3: return /* simple check */; // ... } throw std::logic_error(""); } Error make_error(const request& r) { switch(r.type) { case request::type_1: return /* complex error msg generation */; case request::type_2: return /* complex error msg generation */; case request::type_3: return /* complex error msg generation */; // ... } throw std::logic_error(""); } std::optional<Error> validate(const request& r) { if(is_valid(r)) return std::nullopt; return make_error(r) } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Isolating cold path 75
  • 176. if(expensiveCheck() && fastCheck()) { /* ... */ } if(fastCheck() && expensiveCheck()) { /* ... */ } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Expression short-circuiting WRONG GOOD 76
  • 177. if(expensiveCheck() && fastCheck()) { /* ... */ } if(fastCheck() && expensiveCheck()) { /* ... */ } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Expression short-circuiting WRONG GOOD Bail out as early as possible and continue fast path 76
  • 178. int foo(int i) { int k = 0; for(int j = i; j < i + 10; ++j) ++k; return k; } int foo(unsigned i) { int k = 0; for(unsigned j = i; j < i + 10; ++j) ++k; return k; } C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Integer arithmetic 77
  • 179. int foo(int i) { int k = 0; for(int j = i; j < i + 10; ++j) ++k; return k; } int foo(unsigned i) { int k = 0; for(unsigned j = i; j < i + 10; ++j) ++k; return k; } foo(int): mov eax, 10 ret foo(unsigned int): cmp edi, -10 sbb eax, eax and eax, 10 ret C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Integer arithmetic 77
  • 180. int foo(int i) { int k = 0; for(int j = i; j < i + 10; ++j) ++k; return k; } int foo(unsigned i) { int k = 0; for(unsigned j = i; j < i + 10; ++j) ++k; return k; } foo(int): mov eax, 10 ret foo(unsigned int): cmp edi, -10 sbb eax, eax and eax, 10 ret C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Integer arithmetic Integer arithmetic differs for the signed and unsigned integral types 77
  • 181. • Avoid using std::list, std::map, etc C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 182. • Avoid using std::list, std::map, etc • std::unordered_map is also broken C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 183. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 184. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar • Preallocate storage – free list – plf::colony C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 185. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar • Preallocate storage – free list – plf::colony • Consider storing only pointers to objects in the container • Consider storing expensive hash values with the key C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 186. • Avoid using std::list, std::map, etc • std::unordered_map is also broken • Prefer std::vector – also as the underlying storage for hash tables or flat maps – consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar • Preallocate storage – free list – plf::colony • Consider storing only pointers to objects in the container • Consider storing expensive hash values with the key • Limit the number of type conversions C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 78
  • 187. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 188. • Keep the number of threads close (less or equal) to the number of available physical CPU cores C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 189. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 190. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 191. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 192. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 193. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 194. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 195. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 196. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 197. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! • Bypass the kernel (100%user space code) C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 198. • Keep the number of threads close (less or equal) to the number of available physical CPU cores • Separate IO threads from business logic thread (unless business logic is extremely lightweight) • Use fixed size lock free queues / busy spins to pass the data between threads • Use optimal algorithms/data structures and data locality principle • Precompute, use compile time instead of runtime whenever possible • The simpler the code, the faster it is likely to be • Do not try to be smarter than the compiler • Know the language, tools, and libraries • Know your hardware! • Bypass the kernel (100%user space code) • Measure performance… ALWAYS C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to develop system with low-latency constraints 79
  • 199. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The most important recommendation 80
  • 200. Always measure your performance! C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency The most important recommendation 80
  • 201. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 202. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 203. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" .. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 204. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" .. • Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs • Use tools like Intel VTune C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 205. • Always measure Release version cmake -DCMAKE_BUILD_TYPE=Release .. cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. • Prefer hardware based black box performance meassurements • In case that is not possible or you want to debug specific performance issue use profiler • To gather meaningful stack traces preserve frame pointer cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" .. • Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs • Use tools like Intel VTune • Verify output assembly code C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency How to measure the performance of your programs 81
  • 206. Flame Graph Search test_bu.. x.. d.. s.. gener.. S.. _.. unary_.. execute_command_internal x.. bash shell_expand_word_list red.. execute_builtin __xst.. __dup2 main _.. do_redirection_internal do_redirections do_f.. ext4_f.. __GI___li.. __.. expand_word_list_internal unary_.. expan..tra.. path.. execute_command two_ar.. cleanup_redi.. execute_command do_sy.. trac.. __GI___libc_.. posixt.. execute_builtin_or_function _.. [unknown] tra.. expand_word_internal _.. tracesys ex.. _.. vfs_write c.. test_co.. e.. u.. v.. SY.. i.. execute_simple_command execute_command_internal p.. gl.. s..sys_write __GI___l.. expand_words d.. __libc_start_main d.. do_sync.. generi.. e.. reader_loop sy.. execute_while_command do_redirecti.. sys_o.. g.. __gene.. __xsta.. tracesys c.. _.. execute_while_or_until C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency Flamegraph 82