These are the slides from my keynote talk on C++ on CoreHard Autumn 2019. You can find the recording here: https://www.youtube.com/watch?v=nqf53MlnMpo. All my other talks can be found here: https://train-it.eu/resources.
---
That talk will present the C++ world seen from Low Latency domain. The world where no dynamic allocations are welcomed, C++ exceptions are nearly not used, where STL containers are often not enough, and where developers often need to go deep down to assembly level to verify if the code really does its best.
These are the slides from my 4Developers 2017 talk. You can find the recording here: https://www.youtube.com/watch?v=pU0VRMqM5vs. All my other talks can be found here: https://train-it.eu/resources.
---
That talk will present the C++ world seen from Low Latency domain. The world where no dynamic allocations are welcomed, C++ exceptions are nearly not used, where STL containers are often not enough, and where developers often need to go deep down to assembly level to verify if the code really does its best.
[Data Meetup] Data Science in Finance - Building a Quant ML pipelineData Science Society
Georgi Kirov shares a common market-neutral statistical arbitrage framework. It will help showcase the many different ways to structure a systematic research project. From data reconciliation and signal backtesting to optimization and execution, what are some principled ways to evaluate and compare ML ideas? This process inevitably depends on the characteristics of a specific strategy, for instance, if it is liquidity-taking or liquidity-making.
Unlock Cost Savings for your IBM z Systems EnvironmentPrecisely
For many larger enterprises, the pressure to squeeze more performance out of existing mainframes, while avoiding or deferring major upgrades, is a top priority. Routine sort processing tasks such as sort, copy, merge, and compression take up a large and disproportionate share of CPU processing time and related mainframe resources. They are thus obvious areas to examine when seeking relief from mainframe cost and performance pressures.
When evaluating potential solutions many organizations want to “try before they buy” to ensure strong ROI. We have developed tools to help you understand your batch processing and calculate your expected ROI.
Watch this on-demand webinar to hear about:
• How our Syncsort MFX solution drive cost savings
• What a “try before you buy” evaluation looks like
• How to analyze results for your specific situation
New Mainframe Developments and How Syncsort Helps You Exploit ThemPrecisely
Mainframes have been the workhorses of the corporate data center for more than half a century. More than 70% of Fortune 500 enterprises continue to use mainframes for their most crucial business functions. Syncsort has been the leader in mainframe innovations since the early years. Recently, IBM made two major announcements for their mainframe customers.
IBM has recently made two major IBM z announcements: new Tailored Fit Pricing and the next-generation z15 mainframe. In this on-demand webinar we’ll help you understand the opportunities and challenges that the new pricing model and z15 systems present. We’ll also share exciting sorting technology innovations from Syncsort that drive cost savings with the new z15.
You will learn:
• How to evaluate the decision to move to Tailored Fit pricing or to stay on 4HRA pricing
• How Syncsort's Elevate MFSort (formerly MFX) reduces workloads on the new z15 mainframes
• How we plan to exploit the IBM’s new instructions for sort acceleration
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfRussFustino
Have we found nirvana for Blockchain developers? This session will focus on building blockchain dApps (decentralized Apps) and deploy to blockchain! The session will cover getting started building dApps with PyTeal & Beaker, Reach and C#. We will cover how to set up your development environment as well as walk through a simple app frontend and backend. Finally, we will look one the huge benefits of Reach in the built-in verification process. Reach provides automatic verifications to ensure that your program does not lose, lock away, or overspend funds and guarantees that your applications are free from this entire category of errors. Also covered are building dApps with Python using PyTeal and Algorand for Visual Studio extension for C#.
In this session you will learn how to...
PyTeal
Use Reach to deploy on multiple blockchains
Set up development environment
Create a simple dApp
Verify a dApp
Algostudio Visual Studio extension for C#/.NET
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...DevOps.com
DevOps engineers would love to measure the performance and availability of every page of a website or application. It’s easy enough to do technically. Just attach a Synthetic User Monitor to each page and then see the results in an easy-to-use data presentation tool. The snag is that it can be expensive since Synthetic User Monitoring tools usually come as part of a large Application Performance Monitoring package.
In this webinar, learn how you can use open source tools as a powerful option for Synthetic Monitoring and be on the road to building observability into your applications.
These are the slides from my 4Developers 2017 talk. You can find the recording here: https://www.youtube.com/watch?v=pU0VRMqM5vs. All my other talks can be found here: https://train-it.eu/resources.
---
That talk will present the C++ world seen from Low Latency domain. The world where no dynamic allocations are welcomed, C++ exceptions are nearly not used, where STL containers are often not enough, and where developers often need to go deep down to assembly level to verify if the code really does its best.
[Data Meetup] Data Science in Finance - Building a Quant ML pipelineData Science Society
Georgi Kirov shares a common market-neutral statistical arbitrage framework. It will help showcase the many different ways to structure a systematic research project. From data reconciliation and signal backtesting to optimization and execution, what are some principled ways to evaluate and compare ML ideas? This process inevitably depends on the characteristics of a specific strategy, for instance, if it is liquidity-taking or liquidity-making.
Unlock Cost Savings for your IBM z Systems EnvironmentPrecisely
For many larger enterprises, the pressure to squeeze more performance out of existing mainframes, while avoiding or deferring major upgrades, is a top priority. Routine sort processing tasks such as sort, copy, merge, and compression take up a large and disproportionate share of CPU processing time and related mainframe resources. They are thus obvious areas to examine when seeking relief from mainframe cost and performance pressures.
When evaluating potential solutions many organizations want to “try before they buy” to ensure strong ROI. We have developed tools to help you understand your batch processing and calculate your expected ROI.
Watch this on-demand webinar to hear about:
• How our Syncsort MFX solution drive cost savings
• What a “try before you buy” evaluation looks like
• How to analyze results for your specific situation
New Mainframe Developments and How Syncsort Helps You Exploit ThemPrecisely
Mainframes have been the workhorses of the corporate data center for more than half a century. More than 70% of Fortune 500 enterprises continue to use mainframes for their most crucial business functions. Syncsort has been the leader in mainframe innovations since the early years. Recently, IBM made two major announcements for their mainframe customers.
IBM has recently made two major IBM z announcements: new Tailored Fit Pricing and the next-generation z15 mainframe. In this on-demand webinar we’ll help you understand the opportunities and challenges that the new pricing model and z15 systems present. We’ll also share exciting sorting technology innovations from Syncsort that drive cost savings with the new z15.
You will learn:
• How to evaluate the decision to move to Tailored Fit pricing or to stay on 4HRA pricing
• How Syncsort's Elevate MFSort (formerly MFX) reduces workloads on the new z15 mainframes
• How we plan to exploit the IBM’s new instructions for sort acceleration
Build Blockchain dApps using JavaScript, Python and C - ATO.pdfRussFustino
Have we found nirvana for Blockchain developers? This session will focus on building blockchain dApps (decentralized Apps) and deploy to blockchain! The session will cover getting started building dApps with PyTeal & Beaker, Reach and C#. We will cover how to set up your development environment as well as walk through a simple app frontend and backend. Finally, we will look one the huge benefits of Reach in the built-in verification process. Reach provides automatic verifications to ensure that your program does not lose, lock away, or overspend funds and guarantees that your applications are free from this entire category of errors. Also covered are building dApps with Python using PyTeal and Algorand for Visual Studio extension for C#.
In this session you will learn how to...
PyTeal
Use Reach to deploy on multiple blockchains
Set up development environment
Create a simple dApp
Verify a dApp
Algostudio Visual Studio extension for C#/.NET
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...DevOps.com
DevOps engineers would love to measure the performance and availability of every page of a website or application. It’s easy enough to do technically. Just attach a Synthetic User Monitor to each page and then see the results in an easy-to-use data presentation tool. The snag is that it can be expensive since Synthetic User Monitoring tools usually come as part of a large Application Performance Monitoring package.
In this webinar, learn how you can use open source tools as a powerful option for Synthetic Monitoring and be on the road to building observability into your applications.
In this educational webinar, the heads of mainframe development and product management from Syncsort will discuss key mainframe optimization problems, opportunities and use cases for 2017, spanning DB2 and network management on z/OS, as well as new ways to save on your monthly IBM MLC charges and new options for long-standing mainframe issues.
One of the most CPU resource intensive, and highly used functions, within every IT environment is sorting. Virtually every application needs to sort data, as well as copy information from one location to another. Utilization of zIIP engines for sort, copy, and compression provides an organization with additional processing capacity and can limit the expansion of the 4-Hour Rolling Average of LPAR Hourly MSU Utilization (4HRA) which will prevent an in increase Monthly License Charges (MLC).
View this IBM Systems Magazine webcast to learn:
• How to assess what's driving peaks in your 4HRA
• 5 advantages of offloading workloads to zIIP engines
• Examples of Fortune 500 companies maximizing their mainframe through zIIP exploitation
In this educational webinar, the heads of mainframe development and product management from Syncsort will discuss key mainframe optimization problems, opportunities and use cases for 2017, spanning DB2 and network management on z/OS, as well as new ways to save on your monthly IBM MLC charges and new options for long-standing mainframe issues.
Watch this webcast on-demand to learn:
o How the latest innovations for zIIP and sort will save you time and money each month
o How workload-centric database optimization changes the game for DB2 and IDMS
o New options for long-standing network management and IMS cost and resource issues
Relatore: Fausto Vaninetti, Consulting Systems Engineer in Cisco Systems, Inc.
L’avvento dello storage all flash, nuove regolamentazioni quali GDPR e la continua pressione verso un datacenter sempre disponibile hanno reso gli analytics per ottimizzare la gestione IT un elemento imprescindibile per molte aziende. I nuovi switch Fibre Channel 32G della serie Cisco MDS 9000, abbinati alle piu’ recenti versioni del Sistema Operativo NX OS, forniscono delle funzionalita’ integrate di analytics e telemetria che possono essere esposte tramite interfaccia a comandi, interfaccia grafica Datacenter Network Manager o strumenti di terze parti. Questo intervento si propone di far luce su questa nuova funzionalita’ unica nel suo genere, esplorandone brevemente i casi d’uso e le modalita’ di implementazione.
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Pasan De Silva
Paper presented, - [ H. Litz, C. Leber, and B. Geib,"DSL programmable engine for high frequency trading acceleration," In Proceedings of the fourth workshop on High performance computational finance (WHPCF '11). ACM, New York, USA, November, 13, 2011, pp. 31-38. ] addresses the domain of High frequency trading Acceleration using FPGA solutions from the perspective of a HFT firm
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Cloud Native Day Tel Aviv
You might think taking your application to Kubernetes is easy. Just pack them in a Docker container, deploy and you're done!
In reality, the challenges of taking your existing application to the cloud native environment of Kubernetes are huge! They require changes in the way your applications behave and the way you administer them.
Do you really know how to get up and running with your existing applications in Kubernetes?
In this talk I will share my lessons learned taking JFrog's existing applications, prepping and deploying them to Kubernetes.
I'll go over some best practices of preparing your application for Kubernetes with some examples for what we did.
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...QuantInsti
This presentation was delivered by QuantInsti director and co-founder Mr Sameer Kumar at a workshop on 'Technology Edge in Algorithmic Trading' organized at Finbridge Expo 2015, Mumbai on 14th March 2015.
This event was India’s first ever integrated financial service providers’ exhibition that brought together top mutual funds, insurance companies, leading brokers, banks or financial institutions, fund managers, decision makers along with thousands of retail financial market participants, distributors, investors and vendors.
In this workshop Mr Sameer Kumar has explained how latest technology has changed trading environment and how you can increase your profits by taking advantage of it. He thoroughly discussed how tradition trading systems evolved into automated trading systems over past few years. Following are the topics that are covered in this presentation,
1) Significance of Latency
2) Traditional Trading System Architecture
3) Automated Trading System Architecture
4) Low Latency
5) Current Scenario
6) Performance Degradation with Throughput
7) Causes of Degradation
8) Parallel computing
9) Network Effects
10) Latency by Distance
11) Spread Networks
12) Microbursts
13) Latency Breakdown
14) Technology Mix
15) Tips
16) Sample Solution Path
This presentation will help you understand the traditional and automated trading system architecture, it will explain you the significance of latency in an automated trading environment and few tips along with a sample solution path.
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios
Paloma Galan's presentation on Monitoring Financial Protocols With Nagios.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
GPGPU -- это использование графического процессора (GPU) для выполнения общих вычислений, которые обычно проводит центральный процессор (CPU). Благодаря большим вычислительным ресурсам GPU, данный подход позволяет ускорить некоторые приложения в десятки раз по сравнению с традиционным CPU. Принимая во внимание, что GPU есть во множестве современных устройств, данный подход может стать полезных инструментом для программиста, заботящегося о производительности своих программ. Доклад является введением в технологию GPGPU. В ходе презентации, обсуждаются различия между CPU и GPU на аппаратном уровне и объясняется, как эти различия привели к разным моделям программирования этих устройств. Будут рассмотрены классы задач, которые хорошо ускоряются при помощи GPGPU, и когда GPU может оказаться медленнее чем CPU. Доклад не фокусируются на каком-то определенном GPGPU API (OpenCL, CUDA и т.д.) и не требует от слушателей предварительных знаний аппаратуры GPU или CPU.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Building Your Data Streams for all the IoTDevOps.com
Apache NIFI and OPC-UA have been game changers in the world of IoT, allowing you to automate the flow of data from IoT sensors to just about anywhere you want. In this presentation, Craig Hobbs of InfluxData will demonstrate how you can collect and process data streams of millions of values per second using an OPC-UA server and automate into an InfluxDB stream processing engine using Nifi. You'll learn how to setup your workflow to gain a highly-available, fast, scalable, and reliable data stream as a real-world application.
Introducing MFX for z/OS 2.1 & ZPSaver SuitePrecisely
Syncsort MFX for z/OS 2.1 leverages the latest advancements in mainframe technology, including IBM’s zHPF architecture and deeper exploitation of zIIP engines, to help you overcome the challenges of modern mainframe systems.
Free Lunch is Over: Why is C++ so Important in the Modern World?Mateusz Pusz
"Don’t pay for what you don’t use" and "Don't leave room for a language below C++ (except assembler)" are the C++ design and evolution priorities that have stuck with us since the beginning of the programming language. CPU clock frequencies stopped growing more than 10 years ago, and since that time, our processor cores are no longer accelerating. Moreover, the demand for power savings is higher than ever for both small factor devices, like mobile phones, and for huge factor server farms. In the domains where each CPU cycle counts, C++ is the fastest and the most robust solution on the market. C++ programming language is evolving faster now than ever before — and EPAM is an active contributor to this transformation. This talk will briefly summarize the history of C++, describe the benefits of the programming language, highlight the business domains where its usage is essential, and provide some insights on the future of the language and its standardization process.
A Physical Units Library for the Next C++Mateusz Pusz
These are the slides from my CppCon 2020 talk. You can find the recording here: https://www.youtube.com/watch?v=7dExYGSOJzo. All my other talks can be found here: https://train-it.eu/resources.
---
During CppCon 2019 Mateusz provided an overview of current solutions on the market as well as challenges of implementing a modern C++ physical units library. This year's talk will focus on 'mp-units', the library that was developed by Mateusz and contributors, and which is proposed for C++ standardization. During the tutorial, we will get familiar with the building blocks of the library's framework and its most important concepts. Numerous code examples will present how to use and solve real-life problems with the library. The audience will learn how easy it is to extend it with new units, dimensions, or even whole new systems of quantities. Last but not least Mateusz will provide a fair comparison of how well this library performs in comparison to other products on the market.
In this educational webinar, the heads of mainframe development and product management from Syncsort will discuss key mainframe optimization problems, opportunities and use cases for 2017, spanning DB2 and network management on z/OS, as well as new ways to save on your monthly IBM MLC charges and new options for long-standing mainframe issues.
One of the most CPU resource intensive, and highly used functions, within every IT environment is sorting. Virtually every application needs to sort data, as well as copy information from one location to another. Utilization of zIIP engines for sort, copy, and compression provides an organization with additional processing capacity and can limit the expansion of the 4-Hour Rolling Average of LPAR Hourly MSU Utilization (4HRA) which will prevent an in increase Monthly License Charges (MLC).
View this IBM Systems Magazine webcast to learn:
• How to assess what's driving peaks in your 4HRA
• 5 advantages of offloading workloads to zIIP engines
• Examples of Fortune 500 companies maximizing their mainframe through zIIP exploitation
In this educational webinar, the heads of mainframe development and product management from Syncsort will discuss key mainframe optimization problems, opportunities and use cases for 2017, spanning DB2 and network management on z/OS, as well as new ways to save on your monthly IBM MLC charges and new options for long-standing mainframe issues.
Watch this webcast on-demand to learn:
o How the latest innovations for zIIP and sort will save you time and money each month
o How workload-centric database optimization changes the game for DB2 and IDMS
o New options for long-standing network management and IMS cost and resource issues
Relatore: Fausto Vaninetti, Consulting Systems Engineer in Cisco Systems, Inc.
L’avvento dello storage all flash, nuove regolamentazioni quali GDPR e la continua pressione verso un datacenter sempre disponibile hanno reso gli analytics per ottimizzare la gestione IT un elemento imprescindibile per molte aziende. I nuovi switch Fibre Channel 32G della serie Cisco MDS 9000, abbinati alle piu’ recenti versioni del Sistema Operativo NX OS, forniscono delle funzionalita’ integrate di analytics e telemetria che possono essere esposte tramite interfaccia a comandi, interfaccia grafica Datacenter Network Manager o strumenti di terze parti. Questo intervento si propone di far luce su questa nuova funzionalita’ unica nel suo genere, esplorandone brevemente i casi d’uso e le modalita’ di implementazione.
Dsl programmable engine for high frequency trading by Heiner Litz et al. (Pre...Pasan De Silva
Paper presented, - [ H. Litz, C. Leber, and B. Geib,"DSL programmable engine for high frequency trading acceleration," In Proceedings of the fourth workshop on High performance computational finance (WHPCF '11). ACM, New York, USA, November, 13, 2011, pp. 31-38. ] addresses the domain of High frequency trading Acceleration using FPGA solutions from the perspective of a HFT firm
Kubernetes is hard! Lessons learned taking our apps to Kubernetes - Eldad Ass...Cloud Native Day Tel Aviv
You might think taking your application to Kubernetes is easy. Just pack them in a Docker container, deploy and you're done!
In reality, the challenges of taking your existing application to the cloud native environment of Kubernetes are huge! They require changes in the way your applications behave and the way you administer them.
Do you really know how to get up and running with your existing applications in Kubernetes?
In this talk I will share my lessons learned taking JFrog's existing applications, prepping and deploying them to Kubernetes.
I'll go over some best practices of preparing your application for Kubernetes with some examples for what we did.
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...QuantInsti
This presentation was delivered by QuantInsti director and co-founder Mr Sameer Kumar at a workshop on 'Technology Edge in Algorithmic Trading' organized at Finbridge Expo 2015, Mumbai on 14th March 2015.
This event was India’s first ever integrated financial service providers’ exhibition that brought together top mutual funds, insurance companies, leading brokers, banks or financial institutions, fund managers, decision makers along with thousands of retail financial market participants, distributors, investors and vendors.
In this workshop Mr Sameer Kumar has explained how latest technology has changed trading environment and how you can increase your profits by taking advantage of it. He thoroughly discussed how tradition trading systems evolved into automated trading systems over past few years. Following are the topics that are covered in this presentation,
1) Significance of Latency
2) Traditional Trading System Architecture
3) Automated Trading System Architecture
4) Low Latency
5) Current Scenario
6) Performance Degradation with Throughput
7) Causes of Degradation
8) Parallel computing
9) Network Effects
10) Latency by Distance
11) Spread Networks
12) Microbursts
13) Latency Breakdown
14) Technology Mix
15) Tips
16) Sample Solution Path
This presentation will help you understand the traditional and automated trading system architecture, it will explain you the significance of latency in an automated trading environment and few tips along with a sample solution path.
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios
Paloma Galan's presentation on Monitoring Financial Protocols With Nagios.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
GPGPU -- это использование графического процессора (GPU) для выполнения общих вычислений, которые обычно проводит центральный процессор (CPU). Благодаря большим вычислительным ресурсам GPU, данный подход позволяет ускорить некоторые приложения в десятки раз по сравнению с традиционным CPU. Принимая во внимание, что GPU есть во множестве современных устройств, данный подход может стать полезных инструментом для программиста, заботящегося о производительности своих программ. Доклад является введением в технологию GPGPU. В ходе презентации, обсуждаются различия между CPU и GPU на аппаратном уровне и объясняется, как эти различия привели к разным моделям программирования этих устройств. Будут рассмотрены классы задач, которые хорошо ускоряются при помощи GPGPU, и когда GPU может оказаться медленнее чем CPU. Доклад не фокусируются на каком-то определенном GPGPU API (OpenCL, CUDA и т.д.) и не требует от слушателей предварительных знаний аппаратуры GPU или CPU.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Building Your Data Streams for all the IoTDevOps.com
Apache NIFI and OPC-UA have been game changers in the world of IoT, allowing you to automate the flow of data from IoT sensors to just about anywhere you want. In this presentation, Craig Hobbs of InfluxData will demonstrate how you can collect and process data streams of millions of values per second using an OPC-UA server and automate into an InfluxDB stream processing engine using Nifi. You'll learn how to setup your workflow to gain a highly-available, fast, scalable, and reliable data stream as a real-world application.
Introducing MFX for z/OS 2.1 & ZPSaver SuitePrecisely
Syncsort MFX for z/OS 2.1 leverages the latest advancements in mainframe technology, including IBM’s zHPF architecture and deeper exploitation of zIIP engines, to help you overcome the challenges of modern mainframe systems.
Free Lunch is Over: Why is C++ so Important in the Modern World?Mateusz Pusz
"Don’t pay for what you don’t use" and "Don't leave room for a language below C++ (except assembler)" are the C++ design and evolution priorities that have stuck with us since the beginning of the programming language. CPU clock frequencies stopped growing more than 10 years ago, and since that time, our processor cores are no longer accelerating. Moreover, the demand for power savings is higher than ever for both small factor devices, like mobile phones, and for huge factor server farms. In the domains where each CPU cycle counts, C++ is the fastest and the most robust solution on the market. C++ programming language is evolving faster now than ever before — and EPAM is an active contributor to this transformation. This talk will briefly summarize the history of C++, describe the benefits of the programming language, highlight the business domains where its usage is essential, and provide some insights on the future of the language and its standardization process.
A Physical Units Library for the Next C++Mateusz Pusz
These are the slides from my CppCon 2020 talk. You can find the recording here: https://www.youtube.com/watch?v=7dExYGSOJzo. All my other talks can be found here: https://train-it.eu/resources.
---
During CppCon 2019 Mateusz provided an overview of current solutions on the market as well as challenges of implementing a modern C++ physical units library. This year's talk will focus on 'mp-units', the library that was developed by Mateusz and contributors, and which is proposed for C++ standardization. During the tutorial, we will get familiar with the building blocks of the library's framework and its most important concepts. Numerous code examples will present how to use and solve real-life problems with the library. The audience will learn how easy it is to extend it with new units, dimensions, or even whole new systems of quantities. Last but not least Mateusz will provide a fair comparison of how well this library performs in comparison to other products on the market.
Rethinking the Way We do Templates in C++Mateusz Pusz
These are the slides from my CppCon 2019 talk. You can find the recording here: https://www.youtube.com/watch?v=oNBnYhLxlTU. All my other talks can be found here: https://train-it.eu/resources.
---
Template metaprogramming is hard. In case it is hard only for the library implementer then it is not that bad. The problem arises when it also affects the users of this library.
This talk is summarizing my experience and thoughts gathered during the implementation of the Physical Units Library for C++. The way we do template metaprogramming now results with inscrutable compile-time errors and really long type names observed while debugging our code. That is mostly caused by aliasing class template specializations of non-trivial metaprogramming interface designs. Compilation times of such code also leave a lot of room for improvement, and the practices we chose to use in the Standard Library are often suboptimal. Taking into account the Rule of Chiel while designing templated code makes a huge difference in the compile times. This talk presents a few simple examples (including the practices from the C++ Standard Library) of achieving the same goal in different ways and provides benchmark results of time needed to compile such source code.
These are the slides from my talk on C++ London meetup. You can find the recording here: https://www.youtube.com/watch?v=HsC1EoFjf8U. All my other talks can be found here: https://train-it.eu/resources.
---
C++ is the fastest programming language in the world, and it leaves no room for other languages below it (except assembler). That is why it gets more attention now than ever before. As a result, we observe an increasing speed of C++ language evolution. C++11 was a game changer, but is now considered an "old" language already. C++14 & 17 provided many improvements that allow us to write portable, safer, and faster code in a shorter time. The resulting source code is easier to reason about and maintain. Moreover, C++20 is coming soon, and it is going to be a game changer again.
In my talk, I provide a few typical legacy code examples that most of us write every day. I describe how those samples evolve with the changes introduced in the each C++ language release from C++11 up to the major groundbreaking features of the upcoming C++20. During the talk, the audience will learn how much easier, more robust, and safer is the code written using modern tools that we get with every new C++ release.
Implementing a Physical Units Library for C++Mateusz Pusz
These are the slides from my keynote talk on C++ on CoreHard Spring 2019. You can find the recording here: https://www.youtube.com/watch?v=hcgL8QBmh2I. All my other talks can be found here: https://train-it.eu/resources.
---
This talk will present the current state of my work on designing and implementing Physical Units Library for C++. I will present all the challenges, design tradeoffs, and potential solutions to those problems. During the lecture, we will also see how new C++20 features help to make the library interface easier to use, maintain, and extend. Among others, we will see how we can benefit from class types provided as non-type template parameters, how new class template argument deduction rules simplify the interfaces and the full power of using concepts to constrain template types.
Effective replacement of dynamic polymorphism with std::variantMateusz Pusz
These are the slides from my ACCU 2019 talk. You can find the recording here: https://www.youtube.com/watch?v=JGYxOieiZnY. All my other talks can be found here: https://train-it.eu/resources.
---
This short talk presents how easy it is to replace some cases of dynamic polymorphism with std::variant. During the lecture, we will analyze and compare a few implementations of the same simple Finite State Machine. It turns up that variant-based code is not only much faster but also it gives us the opportunity to define our interfaces and program flow much better. The talk will end up with the discussion of pros and cons of each approach and will try to give guidelines on when to use them.
Implementing Physical Units Library for C++Mateusz Pusz
These are the slides from my 4Developers 2019 talk. You can find the recording here: https://www.youtube.com/watch?v=O1vlFZcaMKo. All my other talks can be found here: https://train-it.eu/resources.
---
This talk will present the current state of my work on designing and implementing Physical Units Library for C++. I will present all the challenges, design tradeoffs, and potential solutions to those problems. During the lecture, we will also see how new C++20 features help to make the library interface easier to use, maintain, and extend. Among others, we will see how we can benefit from class types provided as non-type template parameters, how new class template argument deduction rules simplify the interfaces, and a full power of using concepts to constrain template types.
C++ Concepts and Ranges - How to use them?Mateusz Pusz
These are the slides from my Meeting C++ 2018 talk. You can find the recording here: https://www.youtube.com/watch?v=pe05ZWdh0N0. All my other talks can be found here: https://train-it.eu/resources.
---
The most of the Concepts TS and Ranges TS is merged into the C++20 standard draft document. The talk will present a current design of those features and will provide suggestions on how to use them in our source code. That presentation is meant to inspire discussion on how should we use those long-awaited features while building tools and the C++ Standard Library.
Effective replacement of dynamic polymorphism with std::variantMateusz Pusz
These are the slides from my CppCon 2018 talk. You can find the recording here: https://www.youtube.com/watch?v=gKbORJtnVu8. All my other talks can be found here: https://train-it.eu/resources.
---
This short talk presents how easy it is to replace some cases of dynamic polymorphism with std::variant. During the lecture, we will analyze and compare 2 implementations of the same simple Finite State Machine. It turns up that variant-based code is not only much faster but also it gives us the opportunity to define our interfaces and program flow much better. The talk will end up with the discussion of pros and cons of each approach and will try to give guidelines on when to use them.
Git, CMake, Conan - How to ship and reuse our C++ projects?Mateusz Pusz
These are the slides from my CppCon 2018 talk. You can find the recording here: https://www.youtube.com/watch?v=S4QSKLXdTtA. All my other talks can be found here: https://train-it.eu/resources.
---
Git and CMake are already established standards in our community. However, it is uncommon to see them being used in an efficient way. As a result, many C++ projects have big problems with either importing other dependencies or making themselves easy to import by others. It gets even worse when we need to build many different configurations of our package on one machine.
That presentation lists and addresses the problems of build system and packaging that we have with large, multi-platform, C++ projects that often have many external and internal dependencies. The talk will present how proper use of CMake and Conan package manager can address those use cases. It will also describe current issues with the cooperation of those tools.
If you've attended or seen my talk at C++Now 2018, that time you can expect much more information on Conan and package creation using that tool. I will also describe how the integration of CMake and Conan changed over the last few months.
These are the slides from my C++Now 2018 talk. You can find the recording here: https://www.youtube.com/watch?v=y7PBciQp0B8. All my other talks can be found here: https://train-it.eu/resources.
---
Presentation of features already voted into C++20 Standard Draft in Toronto, Albuquerque, and Jacksonville ISO C++ Committee Meetings as well as the overview of other really promising proposals that have high chances to arrive in C++20.
Pointless Pointers - How to make our interfaces efficient?Mateusz Pusz
These are the slides from my code::dive 2017 talk. You can find the recording here: https://www.youtube.com/watch?v=qrifyjQW9gA. All my other talks can be found here: https://train-it.eu/resources.
---
C++ is not C. C++ developers too often forget about that. The effects are often disastrous. nullptr dereferences, buffer overflows, resource leaks are the problems often seen in C++ applications bug trackers. Does it have to be like that? The talk presents a few simple rules tested in production that will make most of those issues go away and never appear again in the C++ software. Interested? Come and see :-)
These are the slides from my code::dive 2016 talk. You can find the recording here: https://www.youtube.com/watch?v=yCfFt061SU4. All my other talks can be found here: https://train-it.eu/resources.
---
Writing fast C++ applications is a really complex subject. It often turns out that deep but isolated knowledge of ISO C++ standard and algorithmic complexity of operations does not guarantee the success. Often the bottleneck of our applications happens to be the performance of computer’s memory or its wrong usage by our code. The lack of knowledge in that subject can ruin all our ambitions to create high performance implementation.
std::shared_ptr<T> - (not so) Smart hammer for every pointy nailMateusz Pusz
These are the slides from my code::dive 2016 talk. You can find the recording here: https://www.youtube.com/watch?v=hHQS-Q7aMzg. All my other talks can be found here: https://train-it.eu/resources.
---
C++ rule of thumb is “you do not pay for what you do not use”. However, it turns out that this is not the case for some of the utilities from the C++ Standard Library. The key example here is the favorite tool of many developers – std::shared_ptr. The talk will describe the problems related to it in detail. It will also try to answer the question how it was possible to avoid them.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Striving for ultimate Low Latency
1. Mateusz Pusz
November 30, 2019
Striving for ultimate
Low Latency
INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS
2. Special thanks to Carl Cook and Chris Kohlhoff
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
ISO C++ Committee (WG21) Study Group 14 (SG14)
2
3. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
4. Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds, or clock periods.
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
5. Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds, or clock periods.
Throughput is the number of such actions executed or results
produced per unit of time. Measured in units of whatever is being
produced per unit of time.
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency vs Throughput
3
6. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
4
7. Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
4
8. Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What do we mean by Low Latency?
Especially important for internet connections utilizing services such as
trading, online gaming, and VoIP
4
9. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
10. • In VoIP substantial delays between input from conversation participants may impair their
communication
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
11. • In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or appropriate reaction time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
12. • In VoIP substantial delays between input from conversation participants may impair their
communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or appropriate reaction time
• Within capital markets the proliferation of algorithmic trading requires firms to react to market events
faster than the competition to increase profitability of trades
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Why do we strive for Low latency?
5
13. A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
14. A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
15. A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
16. A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
17. A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
18. A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
• Typically, the traders with the fastest execution speeds are more profitable
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
High-Frequency Trading (HFT)
6
19. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
20. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
21. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
22. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
23. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
24. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
25. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Market data processing
7
26. 1-10us 100-1000ns
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
27. 1-10us 100-1000ns
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
28. 1-10us 100-1000ns
• Average human eye blink takes 350 000us (1/3s)
• Millions of orders can be traded that time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How fast do we do?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
8
29. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
9
30. • In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3%on NYSE
– 16.9%on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
KNIGHT CAPITAL
9
31. • In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3%on NYSE
– 16.9%on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
• pre-tax loss of $440 million in 45 minutes
-- LinkedIn
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What if something goes wrong?
KNIGHT CAPITAL
10
32. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
33. 1 Low Latency network
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
34. 1 Low Latency network
2 Modern hardware
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
35. 1 Low Latency network
2 Modern hardware
3 BIOS profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
36. 1 Low Latency network
2 Modern hardware
3 BIOS profiling
4 Kernel profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
37. 1 Low Latency network
2 Modern hardware
3 BIOS profiling
4 Kernel profiling
5 OS profiling
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ often not the most important part of the system
11
38. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
12
39. • Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN
12
40. • Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU affinity
• Assign interrupt affinity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN PIN
12
41. • Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
• Assign CPU affinity
• Assign interrupt affinity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
• Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries
• Use the kernel bypass library
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Spin, pin, and drop-in
SPIN PIN
DROP-IN
12
43. • Typically only a small part of code is really important (a fast/hot path)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
44. • Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
45. • Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– we need low latency and not throughput
– concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
46. • Typically only a small part of code is really important (a fast/hot path)
• This code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
• Multithreading increases latency
– we need low latency and not throughput
– concurrency (even on different cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
• Mistakes are really expensive
– good error checking and recovery is mandatory
– one second is 4 billion CPU instructions (a lot can happen that time)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Characteristics of Low Latency software
14
47. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop software that have predictable performance?
15
48. It turns out that the more important question here is...
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop software that have predictable performance?
15
49. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
16
50. • In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
17
51. • In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
• A task not only for developers but also for
code architects
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How NOT to develop software that have predictable
performance?
17
52. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
53. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
54. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
55. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
56. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
57. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
58. 1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on a likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
6 Dynamic memory allocations
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to avoid on fast path
18
59. template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Supports user provided deleter
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
std::shared_ptr<T>
19
60. template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Supports user provided deleter
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
std::shared_ptr<T>
Too o en overused by C++ programmers
19
61. void foo()
{
std::unique_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
void foo()
{
std::shared_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Question: What is the di erence here?
20
62. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Key std::shared_ptr<T> issues
21
63. • Shared state
– performance + memory footprint
• Mandatory synchronization
– performance
• Type Erasure
– performance
• std::weak_ptr<T> support
– memory footprint
• Aliasing constructor
– memory footprint
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Key std::shared_ptr<T> issues
21
64. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
More info on code::dive 2016
22
65. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
66. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
67. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
68. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
69. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– cannot be ignored!
– simplify interfaces
– (if not thrown) actually can improve application performance
– make source code of likely path easier to reason about
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
23
70. • Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
• ... if they are not thrown
• Throwing an exception can take significant and not deterministic time
• Advantages of C++ exceptions usage
– cannot be ignored!
– simplify interfaces
– (if not thrown) actually can improve application performance
– make source code of likely path easier to reason about
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
C++ exceptions
Not using C++ exceptions is not an excuse to write not exception-safe code!
23
71. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Polymorphism
24
75. • this pointer adjustments needed to call
member function (for not empty base classes)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
26
76. • this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
DIAMOND OF DREAD
27
77. • this pointer adjustments needed to call
member function (for not empty base classes)
• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
Always prefer composition before inheritance!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Multiple inheritance
MULTIPLE INHERITANCE
DIAMOND OF DREAD
27
78. class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
28
79. class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
28
80. class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
O en the sign of a smelly design
28
81. class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
29
82. class base : noncopyable {
public:
virtual ~base() = default;
virtual void foo() = 0;
};
class derived : public base {
public:
void foo() override;
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
void foo(base& b)
{
if(typeid(b) == typeid(derived)) {
derived* d = static_cast<derived*>(&b);
d->boo();
}
}
• Only one comparison of std::type_info
• O en only one runtime pointer compare
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
RunTime Type Identi cation (RTTI)
29
83. • General purpose operation
• Nondeterministic execution performance
• Causes memory fragmentation
• May fail (error handling is needed)
• Memory leaks possible if not properly handled
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Dynamic memory allocations
30
84. • Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
31
85. • Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
31
86. • Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Custom allocators to the rescue
Preallocation makes the allocator jitter more stable, helps in keeping related
data together and avoiding long term fragmentation.
31
87. • Prevent dynamic memory allocation for the (common) case of dealing with small objects
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Small Object Optimization (SOO / SSO / SBO)
32
88. • Prevent dynamic memory allocation for the (common) case of dealing with small objects
class sso_string {
char* data_ = u_.sso_;
size_t size_ = 0;
union {
char sso_[16] = "";
size_t capacity_;
} u_;
public:
[[nodiscard]] size_t capacity() const
{
return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Small Object Optimization (SOO / SSO / SBO)
32
89. template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
33
91. template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
No dynamic allocation
No dynamic memory allocations or pointer indirections guaranteed with the
cost of possibly bigger memory usage
33
92. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
93. 1 Use tools that improve efficiency without sacrificing performance
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
94. 1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
95. 1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
3 Know your hardware
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
96. 1 Use tools that improve efficiency without sacrificing performance
2 Use compile time wherever possible
3 Know your hardware
4 Clearly isolate cold code from the fast path
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The list of things to do on the fast path
34
97. • static_assert()
• Automatic type deduction
• Type aliases
• Move semantics
• noexcept
• constexpr
• Lambda expressions
• type_traits
• std::string_view
• Variadic templates
• and many more...
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Examples of safe to use C++ features
35
98. The fastest programs are those that do nothing
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Do you agree?
36
99. int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
foo(factorial(n)); // not compile-time
foo(factorial(4)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++98)
37
100. int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
foo(factorial(n)); // not compile-time
foo(factorial(4)); // not compile-time
template<int N>
struct Factorial {
static const int value =
N * Factorial<N - 1>::value;
};
template<>
struct Factorial<0> {
static const int value = 1;
};
int array[Factorial<4>::value];
const int precalc_values[] = {
Factorial<0>::value,
Factorial<1>::value,
Factorial<2>::value
};
foo(factorial<4>::value); // compile-time
// foo(factorial<n>::value); // compile-time error
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++98)
37
101. • 2 separate implementations
– hard to keep in sync
• Template metaprogramming is hard!
• Easier to precalculate in Excel and hardcode
results in the code ;-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Just too hard!
38
102. • 2 separate implementations
– hard to keep in sync
• Template metaprogramming is hard!
• Easier to precalculate in Excel and hardcode
results in the code ;-)
const int precalc_values[] = {
1,
1,
2,
6,
24,
120,
720,
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Just too hard!
38
103. Declares that it is possible to evaluate the value of the function or
variable at compile time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
104. Declares that it is possible to evaluate the value of the function or
variable at compile time
• constexpr specifier used in an object declaration implies const
• constexpr specifier used in a function or static member variable declaration implies inline
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
105. Declares that it is possible to evaluate the value of the function or
variable at compile time
• constexpr specifier used in an object declaration implies const
• constexpr specifier used in a function or static member variable declaration implies inline
• Such variables and functions can be used in compile time constant expressions (provided that
appropriate function arguments are given)
constexpr int factorial(int n);
std::array<int, factorial(4)> array;
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
constexpr speci er
39
106. • A constexpr function must not be virtual
• The function body may contain only
– static_assert declarations
– alias declarations
– using declarations and directives
– exactly one return statement with an
expression that does not mutate any of the
objects
constexpr int factorial(int n)
{
return n <= 1 ? 1 : (n * factorial(n - 1));
}
std::array<int, factorial(4)> array;
constexpr std::array<int, 3> precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24, "Error");
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++11)
40
107. • A constexpr function must not be virtual
• The function body must not contain
– a goto statement and labels other than case
and default
– a try block
– an asm declaration
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– evaluation of a lambda expression
– reinterpret_cast and dynamic_cast
– changing an active member of an union
– a new or delete expression
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array<int, 3> precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24, "Error");
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++14)
41
108. • A constexpr function must not be virtual
• The function body must not contain
– a goto statement and labels other than case
and default
– a try block
– an asm declaration
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– reinterpret_cast and dynamic_cast
– changing an active member of an union
– a new or delete expression
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++17)
42
109. • The function body must not contain
– a goto statement and labels other than case
and default
– variable of non-literal type
– variable of static or thread storage duration
– uninitialized variable
– reinterpret_cast
constexpr int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // maybe compile-time
foo(factorial(n)); // not compile-time
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++20)
43
110. • The consteval specifier declares a function to
be an immediate function
• Every call must produce a compile time
constant expression
• An immediate function is a constexpr function
consteval int factorial(int n)
{
int result = n;
while(n > 1)
result *= --n;
return result;
}
std::array<int, factorial(4)> array;
constexpr std::array precalc_values = {
factorial(0),
factorial(1),
factorial(2)
};
static_assert(factorial(4) == 24);
foo(factorial(4)); // compile-time
// foo(factorial(n)); // compile-time error
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time calculations (C++20)
44
114. template<typename T>
inline constexpr bool is_array = false;
template<typename T>
inline constexpr bool is_array<T[]> = true;
static_assert(is_array<int> == false);
static_assert(is_array<int[]>);
template<typename T>
class smart_ptr {
void destroy() noexcept
{
if constexpr(is_array<T>)
delete[] ptr_;
else
delete ptr_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Compile-time dispatch (C++17)
Usage of variable templates and if constexpr decrease compilation time
46
115. struct X {
int a, b, c;
int id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
47
116.
117.
118. struct X {
int a, b, c;
int id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
50
119. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
51
120. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
memcpy(a2.data(), a1.data(), a1.size() * sizeof(X));
}
Ooops!!!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
51
121. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
std::copy(begin(a1), end(a1), begin(a2));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
52
122. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
movq %rsi, %rax
movq 8(%rdi), %rdx
movq (%rdi), %rsi
cmpq %rsi, %rdx
je .L1
movq (%rax), %rdi
subq %rsi, %rdx
jmp memmove
// 100 lines of assembly code
void foo(const std::vector<X>& a1, std::vector<X>& a2)
{
std::copy(begin(a1), end(a1), begin(a2));
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: A negative overhead abstraction
52
123. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
124. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
125. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
Compiler returned: 0 <source>:9:20: error: static assertion failed
9 | static_assert(std::is_trivially_copyable_v<X>);
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
53
126. struct X {
int a, b, c;
int id;
};
struct X {
int a, b, c;
std::string id;
};
static_assert(std::is_trivially_copyable_v<X>); static_assert(std::is_trivially_copyable_v<X>);
Compiler returned: 0 <source>:9:20: error: static assertion failed
9 | static_assert(std::is_trivially_copyable_v<X>);
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time contract
Enforce compile-time contracts and class invariants with contracts or
static_asserts
53
127. enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
std::vector<price> levels_;
public:
void insert(order o)
{
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
128. enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
129. enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{});
if(it != end(levels_) && *it != o.price)
levels_.insert(it, o.price);
// ...
}
bool match(price p) const
{
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
130. enum class side { BID, ASK };
class order_book {
template<side S>
class book_side {
using compare = std::conditional_t<S == side::BID, std::greater<>, std::less<>>;
std::vector<price> levels_;
public:
void insert(order o)
{
const auto it = lower_bound(begin(levels_), end(levels_), o.price, compare{});
if(it != end(levels_) && *it != o.price)
levels_.insert(it, o.price);
// ...
}
bool match(price p) const
{
return compare{}(levels_.back(), p);
}
// ...
};
book_side<side::BID> bids_;
book_side<side::ASK> asks_;
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Type traits: Compile-time branching
54
131. constexpr int array_size = 10'000;
int array[array_size][array_size];
for(auto i = 0L; i < array_size; ++i) {
for(auto j = 0L; j < array_size; ++j) {
array[j][i] = i + j;
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
55
132. constexpr int array_size = 10'000;
int array[array_size][array_size];
for(auto i = 0L; i < array_size; ++i) {
for(auto j = 0L; j < array_size; ++j) {
array[j][i] = i + j;
}
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
Reckless cache usage can cost you a lot of performance!
56
133. Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
Stepping: 10
CPU MHz: 1700.035
CPU max MHz: 1700,0000
CPU min MHz: 400,0000
BogoMIPS: 3799.90
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
57
134. Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
135. Please everybody stand up :-)
• Similar
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
136. Please everybody stand up :-)
• Similar
• < 2x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
137. Please everybody stand up :-)
• Similar
• < 2x
• < 5x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
138. Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
139. Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
140. Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
141. Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
142. Please everybody stand up :-)
• Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
58
143. • Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
• column-major: 2490 ms
• row-major: 51 ms
• ratio: 47x
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
GCC 7.4.0
58
144. • Similar
• < 2x
• < 5x
• < 10x
• < 20x
• < 50x
• < 100x
• >= 100x
• column-major: 2490 ms
• row-major: 51 ms
• ratio: 47x
• column-major: 51 ms
• row-major: 50 ms
• ratio: Similar :-)
Please everybody stand up :-)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Quiz: How much slower is the bad case?
GCC 7.4.0
GCC 9.2.1
58
153. • Program optimization approach motivated by cache coherency
• Focuses on data layout
• Results in objects decomposition
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
DOD (Data-Oriented Design)
65
154. • Program optimization approach motivated by cache coherency
• Focuses on data layout
• Results in objects decomposition
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
DOD (Data-Oriented Design)
Keep data used together close to each other
65
156. struct A {
char c;
short s;
int i;
double d;
static double dd;
};
static_assert( sizeof(A) == ??);
static_assert(alignof(A) == ??);
struct B {
char c;
double d;
short s;
int i;
static double dd;
};
static_assert( sizeof(B) == ??);
static_assert(alignof(B) == ??);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
67
157. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct B {
char c; // size=1, alignment=1, padding=7
double d; // size=8, alignment=8, padding=0
short s; // size=2, alignment=2, padding=2
int i; // size=4, alignment=4, padding=0
static double dd;
};
static_assert( sizeof(B) == 24);
static_assert(alignof(B) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
67
158. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct B {
char c; // size=1, alignment=1, padding=7
double d; // size=8, alignment=8, padding=0
short s; // size=2, alignment=2, padding=2
int i; // size=4, alignment=4, padding=0
static double dd;
};
static_assert( sizeof(B) == 24);
static_assert(alignof(B) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Members order might be important
In order to satisfy alignment requirements of all non-static members of a
class, padding may be inserted a er some of its members
67
159. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
160. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
161. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
using array = std::array<opt, 256>;
static_assert( sizeof(array) == 24 * 256);
static_assert(alignof(array) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
68
162. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
using opt = std::optional<A>;
static_assert( sizeof(opt) == 24);
static_assert(alignof(opt) == 8);
using array = std::array<opt, 256>;
static_assert( sizeof(array) == 24 * 256);
static_assert(alignof(array) == 8);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Be aware of alignment side e ects
Be aware of implementation details of the tools you use every day
68
163. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
69
164. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
On modern hardware may be faster than aligned structure
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
69
165. struct A {
char c; // size=1, alignment=1, padding=1
short s; // size=2, alignment=2, padding=0
int i; // size=4, alignment=4, padding=0
double d; // size=8, alignment=8, padding=0
static double dd;
};
static_assert( sizeof(A) == 16);
static_assert(alignof(A) == 8);
struct C {
char c;
double d;
short s;
static double dd;
int i;
} __attribute__((packed));
static_assert( sizeof(C) == 15);
static_assert(alignof(C) == 1);
On modern hardware may be faster than aligned structure
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Packing
Not portable! May be slower or even crash
69
166. L1 cache reference 0.5 ns
Branch misprediction 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD 1,000,000 ns 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
https://gist.github.com/jboner/2841832
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Latency Numbers Every Programmer Should Know
70
167. class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_)) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
71
168. class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_)) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
What is wrong here?
71
169. class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
void reallocate(size_t len) {
auto old_size = size();
auto largest_align = AlignOf<largest_scalar_t>();
reserved_ += (std::max)(len, growth_policy(reserved_));
// Round up to avoid undefined behavior from unaligned loads and stores.
reserved_ = (reserved_ + (largest_align - 1)) & ~(largest_align - 1);
auto new_buf = allocator_.allocate(reserved_);
auto new_cur = new_buf + reserved_ - old_size;
memcpy(new_cur, cur_, old_size);
cur_ = new_cur;
allocator_.deallocate(buf_);
buf_ = new_buf;
}
// ...
};
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
72
170. class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
Code is data too!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
73
171. class vector_downward {
uint8_t *make_space(size_t len) {
if(len > static_cast<size_t>(cur_ - buf_))
reallocate(len);
cur_ -= len;
// Beyond this, signed offsets may not have enough range:
// (FlatBuffers > 2GB not supported).
assert(size() < FLATBUFFERS_MAX_BUFFER_SIZE);
return cur_;
}
// ...
};
Code is data too!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Separate the code to hot and cold paths
Performance improvement of 20%thanks to a better inlining
73
177. if(expensiveCheck() && fastCheck()) { /* ... */ }
if(fastCheck() && expensiveCheck()) { /* ... */ }
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Expression short-circuiting
WRONG
GOOD
Bail out as early as possible and continue fast path
76
178. int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
77
179. int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
foo(int):
mov eax, 10
ret
foo(unsigned int):
cmp edi, -10
sbb eax, eax
and eax, 10
ret
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
77
180. int foo(int i)
{
int k = 0;
for(int j = i; j < i + 10; ++j)
++k;
return k;
}
int foo(unsigned i)
{
int k = 0;
for(unsigned j = i; j < i + 10; ++j)
++k;
return k;
}
foo(int):
mov eax, 10
ret
foo(unsigned int):
cmp edi, -10
sbb eax, eax
and eax, 10
ret
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
Integer arithmetic
Integer arithmetic differs for the signed and unsigned integral types
77
181. • Avoid using std::list, std::map, etc
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
182. • Avoid using std::list, std::map, etc
• std::unordered_map is also broken
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
183. • Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
184. • Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
185. • Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
• Consider storing only pointers to objects in the container
• Consider storing expensive hash values with the key
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
186. • Avoid using std::list, std::map, etc
• std::unordered_map is also broken
• Prefer std::vector
– also as the underlying storage for hash tables or flat maps
– consider using tsl::hopscotch_map, Swiss Tables, F14 hash map, or similar
• Preallocate storage
– free list
– plf::colony
• Consider storing only pointers to objects in the container
• Consider storing expensive hash values with the key
• Limit the number of type conversions
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
78
187. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
188. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
189. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
190. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
191. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
192. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
193. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
194. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
195. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
196. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
197. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100%user space code)
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
198. • Keep the number of threads close (less or equal) to the number of available physical CPU cores
• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
• Use fixed size lock free queues / busy spins to pass the data between threads
• Use optimal algorithms/data structures and data locality principle
• Precompute, use compile time instead of runtime whenever possible
• The simpler the code, the faster it is likely to be
• Do not try to be smarter than the compiler
• Know the language, tools, and libraries
• Know your hardware!
• Bypass the kernel (100%user space code)
• Measure performance… ALWAYS
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to develop system with low-latency constraints
79
199. C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The most important recommendation
80
200. Always measure your performance!
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
The most important recommendation
80
201. • Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
202. • Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
203. • Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
204. • Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81
205. • Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
• Prefer hardware based black box performance meassurements
• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" ..
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
• Verify output assembly code
C++ CoreHard Autumn 2019 | Striving for ultimate Low Latency
How to measure the performance of your programs
81