Optimizing for performance and reducing latency is a hard problem. Examples could be: choosing a different algorithm and data structures, improving SQL queries, adding a cache, serving requests asynchronously, or some low-level optimization that requires a deep understanding of the OS, kernel, compiler, or the network stack. The engineering effort is usually nontrivial, and only if you're lucky, you'll see some tangible results.
That being said, there are some performance optimization techniques, with a few lines of code — even exist in the built-in library — it can lead to noticeable surprising results. One of these techniques is to "fail fast, retry soon". These techniques are often neglected or taken for granted.
In distributed systems, a service or a database consists of a fleet of nodes that functions as one unit. It is not uncommon for some nodes to go down, usually, for a short time. When this occurs, failures can happen on the client-side and can lead to an outage. To build resilient systems, and reduce the probability of failure, we're going to explore these topics: timeouts, backoff, and jitter. We'll talk about timeouts, what timeout to set, pitfalls of retries, how backoff improves resource utilization, and jitters reduce congestion. Furthermore, we're going to see an adaptive mechanism to dynamically adjust these configurations.
This is inspired by a real-production use case where DynamoDB latency p99 & max went down from > 10s to ~500ms after employing these three techniques: timeouts, backoff, and jitter.
This is inspired by a real-production use case where DynamoDB latency p99 & max went down from > 10s to ~500ms. AWS articles, specifically M. Brooker’s writings, and SDKs code have been great resources to dive into these techniques:
- Timeouts, retries and backoff with jitter in the AWS Builder's Library, 2019 (https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/)
- Exponential Backoff and Jitter on the AWS Architecture Blog, 2016 (https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
- Fixing retries with token buckets and circuit breakers, Marc's Blog, 2022 (https://brooker.co.za/blog/2022/02/28/retries.html)
Using ScyllaDB for Distribution of Game Assets in Unreal EngineScyllaDB
How Epic Games is using ScyllaDB for distribution of large game assets used by Unreal Engine across the world —enabling game developers to more quickly build great games.
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetryScyllaDB
While there are numerous benefits to Event-Driven Architecture, like improved productivity, flexibility, and scalability, they also pose a few disadvantages, such as the complexity of measuring end-to-end latency and identifying bottlenecks in specific services.
This talk shows you how to produce telemetry from your services using an open standard to retain control of data. OpenTelemetry allows you to instrument your application code through vendor-neutral APIs, libraries, and tools. It provides the tools necessary for you to gain visibility into the performance of your services and overall latency.
Anton will share his experience building high-throughput services and strategies to use distributed tracing in an optimal way without affecting the overall performance of the services.
Using eBPF for High-Performance Networking in CiliumScyllaDB
The Cilium project is a popular networking solution for Kubernetes, based on eBPF. This talk uses eBPF code and demos to explore the basics of how Cilium makes network connections, and manipulates packets so that they can avoid traversing the kernel's built-in networking stack. You'll see how eBPF enables high-performance networking as well as deep network observability and security.
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
Pavel Emelyanov, Principal Engineer at ScyllaDB
Botond Denes, C++ Developer at ScyllaDB
What performance-minded engineers need to know.
Hear from Pavel Emelyanov and Botond Dénes on the impact of database internals – specifically, what to look for if you need latency and/or throughput improvements.
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
Using ScyllaDB for Distribution of Game Assets in Unreal EngineScyllaDB
How Epic Games is using ScyllaDB for distribution of large game assets used by Unreal Engine across the world —enabling game developers to more quickly build great games.
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetryScyllaDB
While there are numerous benefits to Event-Driven Architecture, like improved productivity, flexibility, and scalability, they also pose a few disadvantages, such as the complexity of measuring end-to-end latency and identifying bottlenecks in specific services.
This talk shows you how to produce telemetry from your services using an open standard to retain control of data. OpenTelemetry allows you to instrument your application code through vendor-neutral APIs, libraries, and tools. It provides the tools necessary for you to gain visibility into the performance of your services and overall latency.
Anton will share his experience building high-throughput services and strategies to use distributed tracing in an optimal way without affecting the overall performance of the services.
Using eBPF for High-Performance Networking in CiliumScyllaDB
The Cilium project is a popular networking solution for Kubernetes, based on eBPF. This talk uses eBPF code and demos to explore the basics of how Cilium makes network connections, and manipulates packets so that they can avoid traversing the kernel's built-in networking stack. You'll see how eBPF enables high-performance networking as well as deep network observability and security.
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
Pavel Emelyanov, Principal Engineer at ScyllaDB
Botond Denes, C++ Developer at ScyllaDB
What performance-minded engineers need to know.
Hear from Pavel Emelyanov and Botond Dénes on the impact of database internals – specifically, what to look for if you need latency and/or throughput improvements.
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
When you think about C#, you'll usually think about a high-level language, one that is utilized to build websites, APIs, and desktop applications. However, from its inception, C# had the foundation to be used as a system language, with facilities that allow you direct memory access and fine-grained control over memory and execution.
In the last five years, there has been a huge emphasis on making C# a more capable language for system development. Oren Eini, the founder of RavenDB, has used C# as the base language to build a distributed document database for over a decade.
In this talk, Oren will discuss the features that make C# a viable system language for building high-end systems. Learn how you can mix and match, in a single project, both high-level concepts and intimate control over every single thing that is happening in your system.
SQLite is a widely used embedded database engine, known for its simplicity and lightweight design. However, the original SQLite project does not accept contributions from third parties and does not use third-party code, which can limit its potential for innovation. This talk is an overview of SQLite architecture and an introduction to libSQL: Chiselstrike's fork of SQLite.
Piotr Sarna will show how this fork can be used in distributed settings, with automatic backups and the ability to replicate data across multiple nodes. Chiselstrike's modifications also include integration with WebAssembly, which allows users to define custom functions and procedures using Wasm, a compact and portable binary format.
You'll learn the reasons behind this fork of SQLite, and the challenges and trade-offs involved in extending the database with these new features. Piotr also presents Chiselstrike's plans for future work. This talk will be relevant to database researchers and practitioners interested in leveraging SQLite for applications that require custom functions and/or distributed support.
The Linux kernel is undergoing the most fundamental architecture evolution in history and is becoming a microkernel. Why is the Linux kernel evolving into a microkernel? The potentially biggest fundamental change ever happening to the Linux kernel. This talk covers how companies like Facebook and Google use BPF to patch 0-day exploits, how BPF will change the way features are added to the kernel forever, and how BPF is introducing a new type of application deployment method for the Linux kernel.
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingScyllaDB
Running a virtual machine will obviously add some overhead over running on bare metal. This is expected. But there are some cases that the overhead is much higher than expected. This talk discusses using tracing to analyze this overhead from a Linux host running KVM. Ideally, the guest would also be running Linux to get a more detailed explanation of the events, but analysis can still be done when the guest is something else.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
In the networking world there are a number of ways to increase performance over naive use of basic Berkeley sockets. These techniques have ranged from polling blocking sockets, non-blocking sockets controlled by Epoll, all the way through completely bypassing the Linux kernel for maximum network performance where you talk directly to the network interface card by using something like DPDK or Netmap. All these tools have their place, and generally occupy a space from convenience to performance. But in recent years, that landscape has changed massively.. The tools available to the average Linux systems developer have improved from the creation of io_uring, to the expansion of bpf from a simple filtering language to a full-on programming environment embedded directly in the kernel. Along with that came something called XDP (express datapath). This was Linux kernel's answer to kernel-bypass networking. AF_XDP is the new socket type created by this feature, and generally works very similarly to something like DPDK. History lessons out of the way, this talk will look into, and discuss the merits of this technology, it's place in the broader ecosystem and how it can be used to attain the highest level of performance possible. This talk will dive into crucial details, such as how AF_XDP works, how it can be integrated into a larger system and finally more advanced topics such as request sharding/load balancing. There will be detailed look at the design of AF_XDP, the eBpf code used, as well as the userspace code required to drive it all. It will also include performance numbers from this setup compared to regular kernel networking. And most importantly how to put all this together to handle as much data as possible on a single modern multi-core system.
P99 Pursuit: 8 Years of Battling P99 LatencyScyllaDB
Performance engineering is a Sisyphean hill climb for perfection. Those who climb the hill are hardly ever satisfied with the results. You should always ask yourself where the bottleneck is today and what’s holding you back. Great performance improves your software. It enables you to run fewer layers, manage 10x less machines, simplifies your stack, and more.
In this keynote session, ScyllaDB CEO Dor Laor will cover the principles for successful creation of projects like ScyllaDB, KVM, the Linux kernel and explain why they spurred his vision for the P99 CONF.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Building an Event Streaming Architecture with Apache PulsarScyllaDB
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Improving Kafka at-least-once performance at UberYing Zheng
At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
In any software organization, stability & innovation are always at loggerheads - the faster you move, the more things will break. This talk defines what SRE org looks like at high-tech organizations (Google, Uber).
When you think about C#, you'll usually think about a high-level language, one that is utilized to build websites, APIs, and desktop applications. However, from its inception, C# had the foundation to be used as a system language, with facilities that allow you direct memory access and fine-grained control over memory and execution.
In the last five years, there has been a huge emphasis on making C# a more capable language for system development. Oren Eini, the founder of RavenDB, has used C# as the base language to build a distributed document database for over a decade.
In this talk, Oren will discuss the features that make C# a viable system language for building high-end systems. Learn how you can mix and match, in a single project, both high-level concepts and intimate control over every single thing that is happening in your system.
SQLite is a widely used embedded database engine, known for its simplicity and lightweight design. However, the original SQLite project does not accept contributions from third parties and does not use third-party code, which can limit its potential for innovation. This talk is an overview of SQLite architecture and an introduction to libSQL: Chiselstrike's fork of SQLite.
Piotr Sarna will show how this fork can be used in distributed settings, with automatic backups and the ability to replicate data across multiple nodes. Chiselstrike's modifications also include integration with WebAssembly, which allows users to define custom functions and procedures using Wasm, a compact and portable binary format.
You'll learn the reasons behind this fork of SQLite, and the challenges and trade-offs involved in extending the database with these new features. Piotr also presents Chiselstrike's plans for future work. This talk will be relevant to database researchers and practitioners interested in leveraging SQLite for applications that require custom functions and/or distributed support.
The Linux kernel is undergoing the most fundamental architecture evolution in history and is becoming a microkernel. Why is the Linux kernel evolving into a microkernel? The potentially biggest fundamental change ever happening to the Linux kernel. This talk covers how companies like Facebook and Google use BPF to patch 0-day exploits, how BPF will change the way features are added to the kernel forever, and how BPF is introducing a new type of application deployment method for the Linux kernel.
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingScyllaDB
Running a virtual machine will obviously add some overhead over running on bare metal. This is expected. But there are some cases that the overhead is much higher than expected. This talk discusses using tracing to analyze this overhead from a Linux host running KVM. Ideally, the guest would also be running Linux to get a more detailed explanation of the events, but analysis can still be done when the guest is something else.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
In the networking world there are a number of ways to increase performance over naive use of basic Berkeley sockets. These techniques have ranged from polling blocking sockets, non-blocking sockets controlled by Epoll, all the way through completely bypassing the Linux kernel for maximum network performance where you talk directly to the network interface card by using something like DPDK or Netmap. All these tools have their place, and generally occupy a space from convenience to performance. But in recent years, that landscape has changed massively.. The tools available to the average Linux systems developer have improved from the creation of io_uring, to the expansion of bpf from a simple filtering language to a full-on programming environment embedded directly in the kernel. Along with that came something called XDP (express datapath). This was Linux kernel's answer to kernel-bypass networking. AF_XDP is the new socket type created by this feature, and generally works very similarly to something like DPDK. History lessons out of the way, this talk will look into, and discuss the merits of this technology, it's place in the broader ecosystem and how it can be used to attain the highest level of performance possible. This talk will dive into crucial details, such as how AF_XDP works, how it can be integrated into a larger system and finally more advanced topics such as request sharding/load balancing. There will be detailed look at the design of AF_XDP, the eBpf code used, as well as the userspace code required to drive it all. It will also include performance numbers from this setup compared to regular kernel networking. And most importantly how to put all this together to handle as much data as possible on a single modern multi-core system.
P99 Pursuit: 8 Years of Battling P99 LatencyScyllaDB
Performance engineering is a Sisyphean hill climb for perfection. Those who climb the hill are hardly ever satisfied with the results. You should always ask yourself where the bottleneck is today and what’s holding you back. Great performance improves your software. It enables you to run fewer layers, manage 10x less machines, simplifies your stack, and more.
In this keynote session, ScyllaDB CEO Dor Laor will cover the principles for successful creation of projects like ScyllaDB, KVM, the Linux kernel and explain why they spurred his vision for the P99 CONF.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Building an Event Streaming Architecture with Apache PulsarScyllaDB
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Improving Kafka at-least-once performance at UberYing Zheng
At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
In any software organization, stability & innovation are always at loggerheads - the faster you move, the more things will break. This talk defines what SRE org looks like at high-tech organizations (Google, Uber).
Unless you have a problem which scales to many independent tasks easily e.g. web services, you may find that the best way to improve throughput is by reducing latency. This talk starts with Little's Law and it's consequences for high performance computing.
Process management in Operating System_Unit-2mohanaps
In this PPT Of operating system it covers:
Process Concept; Process Control Block; Process Scheduling; CPU Scheduling - Basic Concepts; Scheduling Algorithms – FIFO; RR; SJF; Multi- level; Multi-level feedback. Process Synchronization and deadlocks: The Critical Section Problem; Synchronization hardware; Semaphores; Classical problems; Deadlock: System model; Characterization; Deadlock prevention; Avoidance and Detection.
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScyllaDB
When ingesting large amounts of data into a Scylla cluster, we would like the ingestion to proceed as quickly as possible, but not quicker. We explain how over-eager ingestion could result in a buildup of queues of background writes, possibly to the point of depleting available memory. We then explain how Scylla avoids this risk by automatically slowing down well-behaving applications to the best possible ingestion rate (“flow control”). For applications which cannot be slowed down, Scylla still achieves the highest possible throughput by quicky rejecting excess requests (“admission control”). In this talk we investigate the different causes of queue buildup during writes, including consistency-level lower than “ALL” and materialized views, and review the mechanisms which Scylla uses to automatically solve this problem.
Diesel load testing software is a comprehensive tool for stress testing a website.
Diesel Test is a software designed in Delphi 5, for systems under NT environment.
It is distributed under the GNU LGPL license.
Using Diesel load testing tool you will come to know about how your website will perform in the real world when hundreds, thousands, (or potentially millions) of users would place on your website.
It is designed to test Internet web sites (HTTP and HTTPS requests), with monitoring and graphical representations.
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
The number of deployments of Apache Kafka at enterprise scale has greatly increased in the years since Kafka’s original development in 2010. Along with this rapid growth has come a wide variety of use cases and deployment strategies that transcend what Kafka’s creators imagined when they originally developed the technology. As the scope and reach of streaming data platforms based on Apache Kafka has grown, the need to understand monitoring and troubleshooting strategies has as well.
Dustin Cote and Ryan Pridgeon share their experience supporting Apache Kafka at enterprise-scale and explore monitoring and troubleshooting techniques to help you avoid pitfalls when scaling large-scale Kafka deployments.
Topics include:
- Effective use of JMX for Kafka
- Tools for preventing small problems from becoming big ones
- Efficient architectures proven in the wild
- Finding and storing the right information when it all goes wrong
Visit www.confluent.io for more information.
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTSabrina Marechal
Prasenjit Banerjee, Sr Customer Success Engineer at MuleSoft, will go over how flow processing and tuning strategies have changed from Mule 3 to Mule 4.
Presentation on miscreant jobs in HTCondor presented at HTCondor week 2013. Showing how to reduce the number of bad jobs run and increase the chances of good jobs running quickly.
Resilient service to-service calls in a post-Hystrix worldRares Musina
At N26, we want to make sure we have resilience and fault tolerance built into our backend service-to-service calls. Our services used a combination of Hystrix, Retrofit, Retryer, and other tools to achieve this goal. However, Netflix recently announced that Hystrix is no longer under active development. Therefore, we needed to come up with a replacement solution that maintains the same level of functionality. Since Hystrix provided a big portion of our http client resilience (including circuit breaking, connection thread pool thresholds, easy to add fallbacks, response caching, etc.), we used this announcement as a good opportunity to revisit our entire http client resilience stack. We wanted to find a solution that consolidated our fragmented tooling into an easy-to-use and consistent approach.
This talk will share the approach we are currently implementing and the tools we analyzed while making the decision. Its aim is to provide backend devs (primarily working on JVM languages) and SREs with a comprehensive view on the state of the art for service-to-service call tooling (resilience 4j, envoy, gRPC, retrofit, etc), mechanisms to improve service-to-service call resiliency (timeouts, circuit breaking, adaptive concurrency limits, outlier detection, rate limiting, etc.) and a discussion on where these mechanisms should be implemented (client side, side-car proxy, server-side side-car proxy or server-side).
Deadlock happens when two threads are waiting for a mutex owned by the other (circular deadlock between multiple threads is also possible). Therefore, we need to check for deadlock only when a thread fails to lock a mutex. At that point, the Thread Manager needs to suspend all threads and take over to perform a cycle check on mutex dependency. Finding such a cycle is easily done by performing a tree traversal of the dependencies, and marking threads and mutexes along the way. Using this method, we can detect deadlock and identify all threads and mutexes involved in the deadlock.
Optimizing NoSQL Performance Through ObservabilityScyllaDB
ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. But before you squeeze, make sure you know what to monitor!
Watch our experienced Postgres developer work through monitoring and performance strategies that help him understand what mistakes he’s made moving to NoSQL. And learn with him as our database performance expert offers friendly guidance on how to use monitoring and performance tuning to get his sample Rust application on the right track.
This webinar focuses on using monitoring and performance tuning to discover and correct mistakes that commonly occur when developers move from SQL to NoSQL. For example:
- Common issues getting up and running with the monitoring stack
- Using the CQL optimizations dashboard
- Common issues causing high latency in a node
- Common issues causing replica imbalance
- What a healthy system looks like in terms of memory
- Key metrics to keep an eye on
This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
Discuss the core tradeoffs and considerations involved in order-free and ordered stream processing. Brian Taylor walks through the pros and cons of three different approaches: no data dependency, deferred inter-event data dependency, and streaming inter-event data dependency.
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
We start by setting up a common ground introducing why relational databases fall short, addressing common EDA characteristics such as the need for real-time response times and schemaless approaches to address recurring changes to adapt and on-board new use cases. Next, interact with a sample Rust-based application: a social network app demonstrating an integration of both ScyllaDB and Redpanda.
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
See where an RDBMS-pro’s intuition leads him astray – and learn practical tips for the data modeling transition
ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. However, developers new to high-performance NoSQL intuitively shoot themselves in the foot with respect to things like table design, query design, indexing, and partitioning.
Watch where our experienced Postgres developer intuitively falls into traps that hurt performance and scalability. And learn with him as our database performance expert offers friendly guidance on navigating all the unexpected behaviors that tend to trip up RDBMS experts.
This webinar focuses on common data modeling and querying mistakes that occur when developers move from SQL to NoSQL. For example:
- Understanding query first design principles
- Planning for schema evolution
- Steering clear of common pitfalls and anti-patterns
- Assessing data access patterns
This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
See where an RDBMS-pro’s intuition leads him astray – and learn practical tips for the transition
ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. However, developers new to high-performance NoSQL intuitively shoot themselves in the foot with respect to things like table design, query design, indexing, and partitioning.
Watch where our experienced Postgres developer intuitively falls into traps that hurt performance and scalability. And learn with him as our database performance expert offers friendly guidance on navigating all the unexpected behaviors that tend to trip up RDBMS experts.
Our first webinar of this series will cover common mistakes with practices such as:
- Translating the data model to NoSQL
- Optimizing table design
- Optimizing query performance
- Planning for partitioning
This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
Expert tips on how to maximize your database performance at scale
Untangle the complexity of achieving database performance at scale. Join this webinar to discover commonly overlooked ways to get predictable low latency, even at extreme scale. Our Solution Architects will walk you through the strategies and pitfalls learned by working on thousands of real-world distributed database projects, many reaching 1M OPS with single-digit MS latencies.
In addition to offering clear recommendations, we’ll also explain the process behind how we arrived at them – so you can benefit from the lessons learned by other teams.
We’ll cover how to:
- Design and deploy a large-scale distributed database cluster
- Optimize your clients’ interactions with it
- Expand the cluster horizontally and globally
- Ensure it survives whatever disasters the world throws at it
Tackling your own database performance challenges is serious business. For a change of pace, let’s have some fun learning from other teams’ performance predicaments.
Join us for an interactive session where we dissect four specific database performance challenges faced by teams considering or using ScyllaDB. For each dilemma, we'll:
- Examine the context and technical requirements
- Talk about potential solutions and cover the pros and cons of each
- Disclose what approach the team took, and how it worked out
About the speaker:
Felipe is an IT specialist with years of experience on distributed systems and open-source technologies. He is one of the co-authors of "Database Performance at Scale", an Open Access, freely available publication for individuals interested on improving database performance. At ScyllaDB, he works as a Solution Architect.
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
Linear scaling (sometimes near linear scaling) is often mentioned in several benchmarks, articles and product comparisons as proof that a given technology and algorithmic optimizations perform better than another. But is that really what performance is all about, and should you even care?
This webinar discusses performance beyond linear scalability, including what typically matters more when running high throughput and low latency workloads at scale. We'll cover how ScyllaDB offers unparalleled performance and share our insights on:
- The hidden aspects of linear scaling
- When linear scaling matters most and when it’s simply irrelevant
- Often overlooked considerations for optimizing and measuring distributed systems performance
Watch now to learn from our experience (and lessons learned) in building the fastest NoSQL database in the world.
Navigating Complex Database Performance Hurdles
Tackling your own database performance challenges is serious business. For a change of pace, let’s have some fun learning from other teams’ performance predicaments.
Join us for an interactive session where we dissect 4 specific database performance challenges faced by teams considering or using ScyllaDB. For each dilemma:
- The presenters will describe the context and technical requirements
- Together, we’ll talk about potential solutions and cover the pros and cons of each
- Finally, we’ll disclose what approach the team took, and how it worked out
Throughout the event, we’ll have opportunities to win ScyllaDB swag and prizes! Come prepared to engage in lively discussions and gain valuable insight into database performance strategies.
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
Felipe Cardeneti Mendes, Solutions Architect at ScyllaDB
Navigating workload-specific performance challenges and tradeoffs.
Felipe Mendes covers how to navigate the top performance challenges and tradeoffs that you’re likely to face with your project’s specific workload characteristics and technical/business requirements.
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
Piotr Sarna, Software Engineer at Turso
Understanding and tapping your driver’s performance potential.
Piotr Sarna discusses how to get the most out of a driver, particularly from the performance perspective, and select a driver that’s a good fit for your needs.
Technical risks of putting a cache in front of your database– and what to do instead
Teams experiencing subpar latency commonly turn to an external cache to meet the required SLAs. Placing a cache in front of your database might seem like a fast and easy fix, but it often ends up introducing unanticipated complexity, costs, and risks. External caches can be one of the more problematic components of distributed application architecture.
Join this webinar for a technical discussion of the risks associated with using an external cache and a look at how ScyllaDB’s cache implementation simplifies your architecture without compromising latency. We’ll cover:
- Different approaches to caching (pre-caching vs. caching, side cache vs. transparent cache)
- 7 specific reasons why external caching ia a bad choice
- Why Linux’s default caching doesn’t work well for databases
- The advantages & architecture of ScyllaDB's specialized row-based cache
- Real-world examples of why and how teams eliminated their external cache with ScyllaDB
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
Discover how your team can achieve low latency at the extreme scale that your data-intensive applications require. We’ll walk you through an example of how ScyllaDB scales linearly to achieve 1M and then 2M OPS – with <1ms P99 latency. We’ll cover how this works on a sample realtime app (an ML feature store), share best practices for performance, and talk about the most important tradeoffs you’ll need to negotiate.
Join us to learn:
- Why and how to ensure your database takes full advantage of your cloud infrastructure
- What architectural considerations matter most for high throughput and low latency
- Key factors to consider when selecting a high-performance database
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
Teams experiencing subpar latency commonly turn to an external cache to meet the required SLAs. Placing a cache in front of your database might seem like a fast and easy fix, but it often ends up introducing unanticipated complexity, costs, and risks. Caches can be one of the more problematic components of distributed application architecture.
Join this webinar for a technical discussion of the risks associated with using an external cache and a look at an alternative strategy that simplifies your architecture without compromising latency. We’ll cover:
- Different approaches to caching (pre-caching vs. caching, side cache vs. transparent cache)
- 7 specific reasons why external caching can be a bad choice
- Why Linux’s default caching doesn’t work well for databases
- The advantages & architecture of specialized row-based caches
- Real-world examples of why and how teams eliminated their external cache
Expert tips on how to maximize your database potential
If you’re considering or getting started with ScyllaDB, you’re probably intrigued by its potential to achieve high throughput and predictable low latency at a reasonable cost. So how do you ensure that you’re maximizing that potential for your team’s specific workloads and use case?
This webinar offers practical advice for navigating the various decision points you’ll face as you assess whether ScyllaDB is a good fit for your team and later roll it out into production. We’ll cover the most critical considerations, tradeoffs, and recommendations related to:
- Infrastructure selection
- ScyllaDB configuration
- Client-side setup
- Data modeling
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
In this talk, Felipe Mendes, Solutions Architect at ScyllaDB, shares how 4 companies managed their migration. He covers:
Disney+ – No migration needed!
Discord – Shadow cluster
OpenWeb – TTL expiration, cover Load and Stream
MyHeritage – Counters
ShareChat – Bonus: A bit of everything
In this talk, Lubos discusses tools and methods for a successful migration. He covers:
Methods
Data (re)modeling
APIs
Spark Migrator
DS bulk
Tuning
Testing/monitoring
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
In this talk, Jon discusses practical strategies and issues to consider. He covers:
Reasons for Migrations
DB Functionality
Cost/Licensing
Outdated Technology
Scaling Problems
Technology Evolution
SQL to NoSQL
Build the foundation for success with ScyllaDB
Ready to try out ScyllaDB and want to make sure you’re “doing it right?” We’ll help you get up and running, fast. Spend an hour with our architects for a crash course in what ScyllaDB is all about, the core concepts you need to know, and a step-by-step demonstration of how to get started.
During the live, interactive one-hour session, you will learn:
- Critical considerations for designing a NoSQL system and NoSQL data model
- The technology underlying ScyllaDB’s high performance, availability, and scalability – and best practices for taking advantage of it
- How to install, deploy and operate a full working ScyllaDB system, including multi-data center deployment, monitoring, and connecting an application to the ScyllaDB cluster
By the end of the session, you’ll have the knowledge and tools you need to get ScyllaDB running on your laptop, connect your application to it, and see what it’s like to use ScyllaDB for your specific use case.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
1. Brought to you by
Fail Fast, Retry Soon
Omar Elgabry
Software Engineer at Square
2. Omar Elgabry
Software Engineer at Square
■ A software engineer, a writer, a hackathon winner, with a polymorphic personality.
■ Born in Egypt, lived and worked in India, Turkey, and currently in Canada (Vancouver
and Toronto).
■ Other jobs I like to do: teaching, farming, gardening, wood work, and babysitting!
■ Blog: https://medium.com/@OmarElgabry
3. Intro
In distributed systems, services consist of a fleet of nodes that functions as one unit. It is
not uncommon for some nodes to go down, usually, for a short time. When this occurs,
failures can happen on the client-side and can lead to wide-ranging problems.
To build resilient systems, reduce the probability of failure, and increase the app
performance, we’re going to talk about:
■ Timeouts
■ Retries
■ Backoff
■ Jitters
■ Adaptive retries
5. timeouts
Timeout is the maximum amount of time that a client must wait for something to
happen, e.g. a request to complete.
■ And why should we use timeout?
● No/long timeouts eat resources. When a client is waiting for a request to complete, it holds on
to the limited resources (memory, threads, connections) while waiting for the response.
● Server can run out of these resources if many client requests hold on to these resources for a
long time.
■ Timeout is a best practice not only on the remote calls but also between the
internal calls across processes on the same machine.
6. timeouts
■ What timeout to set?
● Too high → not useful, almost like no timeout
● Too low → terminate request early, increase error
rate
One approach is to use the p99 of the downstream
service as starting point for our the client's
timeout. But, …
● p99 might fluctuate and mightn't be consistent
● p99/max much higher than p95/p99 due to some
outliers
● p99 is almost like p95
● high client network latency
Goal: Reduce the % of the timed out requests
when could eventually succeed (false timeouts)
7. timeouts
■ Why request is timing out?
● Maybe it not because client timeout is short or the downstream service is taking longer
● Maybe the code is establishing a new connection on each request
■ Timeouts might reduce long hanging requests, and thereby, reduce
consumption of limited resources and overall latency, but timeouts don't
reduce error rate.
9. retries
■ Retrying the same (failed) request again often succeed
● Behind the scene, systems usually don't often fail as a single unit. Instead, partial or transient
failures are more common.
■ Retrying is less useful in cases of deterministic errors, where retrying the
request will almost always fail.
● In, eventual consistency systems, however, a client error if retried later might succeed as
system state propagates.
■ Retrying is only safe if an operation is idempotent.
10. timeouts+retries
A real-production use case where the DB max latency went down from > 10s to ~500ms and success rates increased
after employing timeouts+retries.
source: https://medium.com/textnowengineering/the-whacking-game-ee3af79c6e13
11. retries
When partial and transient failures are rare, and the overall number of retried
requests is small, timeouts+retries can improve availability, reduce latency, and
increase success rate.
But these are the same things that retries can put at risk if not used wisely.
■ Retries consume resources
● Retries tradeoff server limited resources (mem, cpu, connections) for higher success rates.
● In almost all cases, we should limit the number client retries.
■ Retries increase load on the downstream service
● … as a result of retrying the failed and timed out requests. If failures are due to service being
overloaded, retrying can delay recovery by keeping the downstream service under a high load
for long.
12. retries
■ Retries increase load on the downstream service (continued). Examples:
● Hot partition
■ Retrying failures mightn't work as we still overwhelming the hot partition
● Multiple service layers
■ When the backend consists of multiple layers of microservices each is retrying
independently, i.e. 81x retries for 4 layers each retrying 3 times.
● Rate Limiting
■ Services such as AWS S3 and Cloudflare have rate limits, so excessive requests will be
throttled.
14. backoff
A solution to retries in succession on a service failing because it’s overloaded.
Instead of retrying immediately and aggressively, the client waits for some period
between retries.
■ What is the benefit?
● Retrying immediately when the likely outcome is another failure, wastes resources.
● Backoff gives the downstream service some breathing time to heal when already overloaded –
so it is not flooded
■ How long should we wait?
● The most common algorithm is the exponential backoff, where the wait time increases
exponentially after every retry.
● Implementations typically cap their backoff to a maximum value to avoid long backoff times.
15. backoff
■ Backoff just "delays" the retries
● Backoff is insufficient when a service is under a constant overload or in case of contention.
● Failed requests when backoff to the same time, they cause contention or overload again when
they are retried.
17. jitter
Adds randomness to the backoff (wait time) when retrying a request to spread out the load
and reduce contention.
■ What jitter to use?
Add Jitter to backoff value (most common) Between zero and backoff value
Sleep
duration
■ (2^retries * delay) +/- random_number
■ (2^retries * delay) * randomization_interval
random_between(0, 2^retries *
delay)
Resource
Utilization
less resources because work is spread out due to randomization
Time to
complete
takes longer to complete, has longer sleep durations takes less time to complete, sleep duration
min value range is 0, i.e. [0, backoff]
When to use if backing-off retries help give downstream service time to
heal
if most failures are due to contention and
spreading out retries is just what we need.
18. jitter
■ Jitter isn't only for retries
● Spreads out spikes of work by periodic jobs, or any repeated work scheduled at regular
intervals, e.g. expiring cache keys around the same time.
20. Code: timeouts, retries, backoff + jitter
// … to be continued
// get a random sleep duration between interval [backoff*0.5, backoff*1.5] in ms
// for e.g. if backoff = 100ms, sleep is any number in the range [50ms, 150ms]
minInterval := backoff / 2 // backoff*0.5
maxInterval := backoff + (backoff / 2) // backoff*1.5
// rand.Intn() returns a rand number from 0 to N (exclusive) so we +1
sleep := time.Millisecond *
time.Duration(minInterval + rand.Intn(maxInterval - minInterval + 1))
time.Sleep(sleep) // wait until retry sleep duration has elapsed
retries++ // increment retries for the next retry attempt
}
22. adaptive
When a large percentage of requests are failing and retries are unsuccessful, like
in cases of longer running issues, the techniques we talked about aren't sufficient.
This warns that future retries are not currently welcome, and that we need to
throttle any un-welcomed retries, until some time period.
■ How to do that?
● We use the token bucket algorithm! This algorithm is widely used in rate limiting to determine
when it is safe to transmit data that complies with the rate limits.
● We’ll also compare token bucket algorithm vs circuit breaker.
23. adaptive
Token bucket (standard) algorithm
Algorithm An in-memory bucket holding tokens (just a counter), and periodically, a fixed
number of tokens is added into the bucket (by increasing the counter)
■ On each request, client removes token(s) from the bucket, and completes
the request.
■ If there aren't sufficient tokens, it throttles the request and either drops it or
waits until there are enough tokens to make the request.
Goal Rate limit the “total number of requests” to downstream service, i.e. when error
rate is high, retries drain the token bucket, and throttle future requests until bucket
slowly begins to refill.
24. adaptive
Token bucket variation algorithm
Algorithm Instead of adding tokens with a fixed amount periodically, we add token(s) on
successful attempts.
■ Client can make initial requests, regardless of the tokens availability.
■ If it succeeds, it adds part of a token into a token bucket, say 0.1 token.
■ If the call fails, retry up to N times as long as there one or more (whole)
token(s) in the bucket.
Goal Rate limit "retries" when error rate is above threshold by throttling "retries" that
exceed that threshold, i.e. max No of retries = only 10% of successful attempts.
25. adaptive
■ Circuit breaker (CB)
● suffers from modality – it's either retrying or not retrying, and can introduce addition time to recovery.
● has no additional load at high failure rates, but lower success rates after threshold as it stops all future retries.
■ Token Bucket (TB)
● has some (tunable) additional load at high failure rates, but higher success rates as it doesn't deplete its bucket
fast enough.
■ Both behave like N retries (without throttling) under low error rates.
26. adaptive
Can we design a better algorithm?
Client libraries have inconsistent behaviour for retries and rate limits across different languages:
■ Rate limits
● Client rely on the its limited knowledge (requests succeeded or failed) to guess what's the best action to take.
Yet, the server knows a bit more.
■ Error Rate:
● Client doesn't know the true failure rate, and it relies on its local sampling of the failure rate, which may vary
from the true rate on the server, e.g. serverless and container-based applications, where clients are short-lived,
with each sending fewer requests.
How can we expose some of that server knowledge to clients so that clients can make informed
decisions, thereby, having consistent behaviour, without increasing complexity? I'll leave that
exercise for you!
27. Recap
■ Timeouts avoid client requests from hanging long while holding on to the
limited resources.
■ Retries can survive partial and transient failures, and therefore, increase the
success rate.
■ Backoff + Jitter can improve resource utilization and reduce congestion.
■ Adaptive Retries dynamically adjusts request rates in response to high error
rates and unsuccessful retries.
28. Final Words
What seemed to be an easy problem, turned out to be quite hard in distributed
systems, and really depends on the nature and the requirements of the system.
Getting the happy path working is the easy part, but going beyond that, is when the
REAL
ENGINEERING
WORK
BEGINS!