Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Talk by Brendan Gregg for USENIX LISA 2019: Linux Systems Performance. Abstract: "
Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas of Linux systems performance: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (Ftrace, bcc/BPF, and bpftrace/BPF), and much advice about what is and isn't important to learn. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Using the new extended Berkley Packet Filter capabilities in Linux to the improve performance of auditing security relevant kernel events around network, file and process actions.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Talk by Brendan Gregg for USENIX LISA 2019: Linux Systems Performance. Abstract: "
Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas of Linux systems performance: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (Ftrace, bcc/BPF, and bpftrace/BPF), and much advice about what is and isn't important to learn. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Using the new extended Berkley Packet Filter capabilities in Linux to the improve performance of auditing security relevant kernel events around network, file and process actions.
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Performance Wins with BPF: Getting StartedBrendan Gregg
Keynote by Brendan Gregg for the eBPF summit, 2020. How to get started finding performance wins using the BPF (eBPF) technology. This short talk covers the quickest and easiest way to find performance wins using BPF observability tools on Linux.
Talk by Brendan Gregg for YOW! 2021. "The pursuit of faster performance in computing is the driving reason for many new technologies and updates. This talk discusses performance improvements now underway that you will likely be adopting soon, for processors (including 3D stacking and cloud vendor CPUs), memory (including DDR5 and high-bandwidth memory [HBM]), disks (including 3D Xpoint as a 3D NAND accelerator), networking (including QUIC and eXpress Data Path [XDP]), runtimes, hypervisors, and more. The future of performance is increasingly cloud-based, with hardware hypervisors and custom processors, meaningful observability of everything down to cycle stalls (even as cloud guests), and high-speed syscall-avoiding applications that use eBPF, FPGAs, and io_uring. The talk also discusses where future performance improvements might be expected, with predictions for new technologies."
Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.
Talk for YOW! by Brendan Gregg. "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud.
"
BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space.
* Who needs this library? Why bypass the kernel?
* How does it work?
* How good is it? What are the benchmarks?
* Pros and cons
Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.
Talk for Facebook Systems@Scale 2021 by Brendan Gregg: "BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. But with many different front-ends and languages, and years of evolution, finding the right starting point can be hard. This talk will make it easy, showing how to install and run selected BPF tools in the bcc and bpftrace open source projects for some quick wins. Think like a sysadmin, not like a programmer."
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
Talk by for Velocity 2017 by Brendan Gregg: Performance analysis superpowers with Linux eBPF.
"Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will investigate this new technology, which sooner or later will be available to everyone who uses Linux. The talk will dive deep on these new tracing, observability, and debugging capabilities. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Video: https://www.youtube.com/watch?v=JRFNIKUROPE . Talk for linux.conf.au 2017 (LCA2017) by Brendan Gregg, about Linux enhanced BPF (eBPF). Abstract:
A world of new capabilities is emerging for the Linux 4.x series, thanks to enhancements that have been included in Linux for to Berkeley Packet Filter (BPF): an in-kernel virtual machine that can execute user space-defined programs. It is finding uses for security auditing and enforcement, enhancing networking (including eXpress Data Path), and performance observability and troubleshooting. Many new open source tools that have been written in the past 12 months for performance analysis that use BPF. Tracing superpowers have finally arrived for Linux!
For its use with tracing, BPF provides the programmable capabilities to the existing tracing frameworks: kprobes, uprobes, and tracepoints. In particular, BPF allows timestamps to be recorded and compared from custom events, allowing latency to be studied in many new places: kernel and application internals. It also allows data to be efficiently summarized in-kernel, including as histograms. This has allowed dozens of new observability tools to be developed so far, including measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more.
This talk will summarize BPF capabilities and use cases so far, and then focus on its use to enhance Linux tracing, especially with the open source bcc collection. bcc includes BPF versions of old classics, and many new tools, including execsnoop, opensnoop, funcccount, ext4slower, and more (many of which I developed). Perhaps you'd like to develop new tools, or use the existing tools to find performance wins large and small, especially when instrumenting areas that previously had zero visibility. I'll also summarize how we intend to use these new capabilities to enhance systems analysis at Netflix.
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
Talk by Brendan Gregg at AWS re:Invent 2019. Abstract: "Extended BPF (eBPF) is an open source Linux technology that powers a whole new class of software: mini programs that run on events. Among its many uses, BPF can be used to create powerful performance analysis tools capable of analyzing everything: CPUs, memory, disks, file systems, networking, languages, applications, and more. In this session, Netflix's Brendan Gregg tours BPF tracing capabilities, including many new open source performance analysis tools he developed for his new book "BPF Performance Tools: Linux System and Application Observability." The talk includes examples of using these tools in the Amazon EC2 cloud."
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Performance Wins with BPF: Getting StartedBrendan Gregg
Keynote by Brendan Gregg for the eBPF summit, 2020. How to get started finding performance wins using the BPF (eBPF) technology. This short talk covers the quickest and easiest way to find performance wins using BPF observability tools on Linux.
Talk by Brendan Gregg for YOW! 2021. "The pursuit of faster performance in computing is the driving reason for many new technologies and updates. This talk discusses performance improvements now underway that you will likely be adopting soon, for processors (including 3D stacking and cloud vendor CPUs), memory (including DDR5 and high-bandwidth memory [HBM]), disks (including 3D Xpoint as a 3D NAND accelerator), networking (including QUIC and eXpress Data Path [XDP]), runtimes, hypervisors, and more. The future of performance is increasingly cloud-based, with hardware hypervisors and custom processors, meaningful observability of everything down to cycle stalls (even as cloud guests), and high-speed syscall-avoiding applications that use eBPF, FPGAs, and io_uring. The talk also discusses where future performance improvements might be expected, with predictions for new technologies."
Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.
Talk for YOW! by Brendan Gregg. "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud.
"
BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space.
* Who needs this library? Why bypass the kernel?
* How does it work?
* How good is it? What are the benchmarks?
* Pros and cons
Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.
Talk for Facebook Systems@Scale 2021 by Brendan Gregg: "BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. But with many different front-ends and languages, and years of evolution, finding the right starting point can be hard. This talk will make it easy, showing how to install and run selected BPF tools in the bcc and bpftrace open source projects for some quick wins. Think like a sysadmin, not like a programmer."
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
Talk by for Velocity 2017 by Brendan Gregg: Performance analysis superpowers with Linux eBPF.
"Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will investigate this new technology, which sooner or later will be available to everyone who uses Linux. The talk will dive deep on these new tracing, observability, and debugging capabilities. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Video: https://www.youtube.com/watch?v=JRFNIKUROPE . Talk for linux.conf.au 2017 (LCA2017) by Brendan Gregg, about Linux enhanced BPF (eBPF). Abstract:
A world of new capabilities is emerging for the Linux 4.x series, thanks to enhancements that have been included in Linux for to Berkeley Packet Filter (BPF): an in-kernel virtual machine that can execute user space-defined programs. It is finding uses for security auditing and enforcement, enhancing networking (including eXpress Data Path), and performance observability and troubleshooting. Many new open source tools that have been written in the past 12 months for performance analysis that use BPF. Tracing superpowers have finally arrived for Linux!
For its use with tracing, BPF provides the programmable capabilities to the existing tracing frameworks: kprobes, uprobes, and tracepoints. In particular, BPF allows timestamps to be recorded and compared from custom events, allowing latency to be studied in many new places: kernel and application internals. It also allows data to be efficiently summarized in-kernel, including as histograms. This has allowed dozens of new observability tools to be developed so far, including measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more.
This talk will summarize BPF capabilities and use cases so far, and then focus on its use to enhance Linux tracing, especially with the open source bcc collection. bcc includes BPF versions of old classics, and many new tools, including execsnoop, opensnoop, funcccount, ext4slower, and more (many of which I developed). Perhaps you'd like to develop new tools, or use the existing tools to find performance wins large and small, especially when instrumenting areas that previously had zero visibility. I'll also summarize how we intend to use these new capabilities to enhance systems analysis at Netflix.
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
Talk by Brendan Gregg at AWS re:Invent 2019. Abstract: "Extended BPF (eBPF) is an open source Linux technology that powers a whole new class of software: mini programs that run on events. Among its many uses, BPF can be used to create powerful performance analysis tools capable of analyzing everything: CPUs, memory, disks, file systems, networking, languages, applications, and more. In this session, Netflix's Brendan Gregg tours BPF tracing capabilities, including many new open source performance analysis tools he developed for his new book "BPF Performance Tools: Linux System and Application Observability." The talk includes examples of using these tools in the Amazon EC2 cloud."
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerRebekah Rodriguez
In this webinar, members of the Server Solution Team as well as a member of Supermicro’s Product Office will discuss Supermicro’s Universal GPU Server, the server’s modular, standards-based design, the important role of OCP Accelerator Module (OAM) form factor, and Universal Baseboard (UBB) in the system, as well as touching on AMD's next generation HPC accelerator. In addition, we will get some insights into trends in the HPC and AI/Machine Learning space, including the different software platforms and best practices that are driving innovation in our industry and daily lives. In particular: • Tools to enable use of the high performance hardware for HPC and Deep Learning applications • Tools to enable use of multiple GPUs, including RDMA, to solve highly demanding HPC and deep learning models, such as BERT • Running applications in containers with AMD’s next generation GPU system
Introduction to Software Defined Visualization (SDVis)Intel® Software
Software defined visualization (SDVis) is an open-source initiative from Intel and industry collaborators. Improve the visual fidelity, performance, and efficiency of prominent visualization solutions, while supporting the rapidly growing big data use on workstations through high-performance computing (HPC) on supercomputing clusters without memory limitations and cost of GPU-based solutions.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
The Supermicro X12 product line, powered by 3rd Gen Intel® Xeon® Scalable processors, contains many innovations that gives organizations more performance for a variety of workloads.
Join this webinar to learn more about the outstanding performance you can get by using Supermicro X12 servers and storage systems using the latest technologies from Intel®.
Watch the webinar: https://www.brighttalk.com/webcast/17278/514618
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureRebekah Rodriguez
The Universal GPU system architecture combines the latest technologies that support multiple GPU form factors, CPU choices, storage, and networking options.Together, these components are optimized to deliver high performance in a balanced architecture in a highly scalable system. Systems can be optimized for each customer’s specific Artificial Intelligence (AI), Machine Learning (ML), or High Performance Computing (HPC) applications. Organizations worldwide are demanding new options for their future computing environments, which have the thermal headroom for the next generation of CPUs and GPUs.
Join this webinar to learn how to leverage Supermicro's Universal GPU system to simplify customer deployments, deliver ultimate modularity and customization options for AI to Omniverse environments.
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
In this video from Switzerland HPC Conference, Martin Hilgeman from Dell presents: HPC Workload Efficiency and the Challenges for System Builders.
"With all the advances in massively parallel and multi-core computing with CPUs and accelerators it is often overlooked whether the computational work is being done in an efficient manner. This efficiency is largely being determined at the application level and therefore puts the responsibility of sustaining a certain performance trajectory into the hands of the user. It is observed that the adoption rate of new hardware capabilities is decreasing and lead to a feeling of diminishing returns. This presentation shows the well-known laws of parallel performance from the perspective of a system builder. It also covers through the use of real case studies, examples of how to program for energy efficient parallel application performance."
Watch the video: http://wp.me/p3RLHQ-gIS
Learn more: http://dell.com
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Join us for an exciting and informative preview of the broadest range of next-generation systems optimized for tomorrow’s data center workloads, Powered by 4th Gen Intel® Xeon® Scalable Processors (formerly codenamed Sapphire Rapids).
Experts from Supermicro and Intel will discuss how the upcoming Supermicro X13 systems will enable new performance levels utilizing state-of-the-art technology, including DDR5, PCIe 5.0, Compute Express Link™ 1.1, and Intel® Advanced Matrix Extensions (Intel AMX).
Similar to Computing Performance: On the Horizon (2021) (20)
USENIX LISA2021 talk by Brendan Gregg (https://www.youtube.com/watch?v=_5Z2AU7QTH4). This talk is a deep dive that describes how BPF (eBPF) works internally on Linux, and dissects some modern performance observability tools. Details covered include the kernel BPF implementation: the verifier, JIT compilation, and the BPF execution environment; the BPF instruction set; different event sources; and how BPF is used by user space, using bpftrace programs as an example. This includes showing how bpftrace is compiled to LLVM IR and then BPF bytecode, and how per-event data and aggregated map data are fetched from the kernel.
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
Keynote for Ubuntu Masters 2019 by Brendan Gregg, Netflix. Video https://www.youtube.com/watch?v=7pmXdG8-7WU&feature=youtu.be . "Extended BPF is a new type of software, and the first fundamental change to how kernels are used in 50 years. This new type of software is already in use by major companies: Netflix has 14 BPF programs running by default on all of its cloud servers, which run Ubuntu Linux. Facebook has 40 BPF programs running by default. Extended BPF is composed of an in-kernel runtime for executing a virtual BPF instruction set through a safety verifier and with JIT compilation. So far it has been used for software defined networking, performance tools, security policies, and device drivers, with more uses planned and more we have yet to think of. It is changing how we use and think about systems. This talk explores the past, present, and future of BPF, with BPF performance tools as a use case."
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
Talk by Brendan Gregg and Martin Spier for the Linkedin Performance Engineering meetup on Nov 8, 2018. FlameScope is a visualization for performance profiles that helps you study periodic activity, variance, and perturbations, with a heat map for navigation and flame graphs for code analysis.
Talk by Brendan Gregg for All Things Open 2018. "At over one thousand code commits per week, it's hard to keep up with Linux developments. This keynote will summarize recent Linux performance features,
for a wide audience: the KPTI patches for Meltdown, eBPF for performance observability and the new open source tools that use it, Kyber for disk I/O sc
heduling, BBR for TCP congestion control, and more. This is about exposure: knowing what exists, so you can learn and use it later when needed. Get the
most out of your systems with the latest Linux kernels and exciting features."
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
Keynote for PerconaLive 2018 by Brendan Gregg. Video: https://youtu.be/sV3XfrfjrPo?t=30m51s . "At over one thousand code commits per week, it's hard to keep up with Linux developments. This keynote will summarize recent Linux performance features, for a wide audience: the KPTI patches for Meltdown, eBPF for performance observability, Kyber for disk I/O scheduling, BBR for TCP congestion control, and more. This is about exposure: knowing what exists, so you can learn and use it later when needed. Get the most out of your systems, whether they are databases or application servers, with the latest Linux kernels and exciting features."
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
Talk for USENIX LISA17: "Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using cgroups. A reverse diagnosis methodology can be applied to identify whether a container is resource constrained, and by which hard or soft resource. The interaction between the host and containers can also be examined, and noisy neighbors identified or exonerated. Performance tooling can need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show you how to identify bottlenecks in the host or container configuration, in the applications by profiling in a container environment, and how to dig deeper into kernel and container internals."
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
Talk for Kernel Recipes 2017 by Brendan Gregg. "Linux perf is a crucial performance analysis tool at Netflix, and is used by a self-service GUI for generating CPU flame graphs and other reports. This sounds like an easy task, however, getting perf to work properly in VM guests running Java, Node.js, containers, and other software, has been at times a challenge. This talk summarizes Linux perf, how we use it at Netflix, the various gotchas we have encountered, and a summary of advanced features."
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
Talk by Brendan Gregg at Kernel Recipes 2017 (Paris): "The in-kernel Berkeley Packet Filter (BPF) has been enhanced in recent kernels to do much more than just filtering packets. It can now run user-defined programs on events, such as on tracepoints, kprobes, uprobes, and perf_events, allowing advanced performance analysis tools to be created. These can be used in production as the BPF virtual machine is sandboxed and will reject unsafe code, and are already in use at Netflix.
Beginning with the bpf() syscall in 3.18, enhancements have been added in many kernel versions since, with major features for BPF analysis landing in Linux 4.1, 4.4, 4.7, and 4.9. Specific capabilities these provide include custom in-kernel summaries of metrics, custom latency measurements, and frequency counting kernel and user stack traces on events. One interesting case involves saving stack traces on wake up events, and associating them with the blocked stack trace: so that we can see the blocking stack trace and the waker together, merged in kernel by a BPF program (that particular example is in the kernel as samples/bpf/offwaketime).
This talk will discuss the new BPF capabilities for performance analysis and debugging, and demonstrate the new open source tools that have been developed to use it, many of which are in the Linux Foundation iovisor bcc (BPF Compiler Collection) project. These include tools to analyze the CPU scheduler, TCP performance, file system performance, block I/O, and more."
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
keynote by Brendan Gregg. "Traditional performance monitoring makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. Modern BSD has advanced tracers and PMC tools, providing virtually endless metrics to aid performance analysis. It's time we really used them, but the problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems.
There's a new way to approach performance analysis that can guide you through the metrics. Instead of starting with traditional metrics and figuring out their use, you start with the questions you want answered then look for metrics to answer them. Methodologies can provide these questions, as well as a starting point for analysis and guidance for locating the root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, chain graphs, and more.
This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. Many methodologies will be discussed, from the production proven to the cutting edge, along with recommendations for their implementation on BSD systems. In general, you will learn to think differently about analyzing your systems, and make better use of the modern tools that BSD provides."
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
Talk by Brendan Gregg for OSSNA 2017. "Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will be a dive deep on these new tracing, observability, and debugging capabilities, which sooner or later will be available to everyone who uses Linux. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
Talk for USENIX ATC 2017 by Brendan Gregg
"The Berkeley Packet Filter (BPF) in Linux has been enhanced in very recent versions to do much more than just filter packets, and has become a hot area of operating systems innovation, with much more yet to be discovered. BPF is a sandboxed virtual machine that runs user-level defined programs in kernel context, and is part of many kernels. The Linux enhancements allow it to run custom programs on other events, including kernel- and user-level dynamic tracing (kprobes and uprobes), static tracing (tracepoints), and hardware events. This is finding uses for the generation of new performance analysis tools, network acceleration technologies, and security intrusion detection systems.
This talk will explain the BPF enhancements, then discuss the new performance observability tools that are in use and being created, especially from the BPF compiler collection (bcc) open source project. These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and much more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations.
Because these BPF enhancements are only in very recent Linux (such as Linux 4.9), most companies are not yet running new enough kernels to be exploring BPF yet. This will change in the next year or two, as companies including Netflix upgrade their kernels. This talk will give you a head start on this growing technology, and also discuss areas of future work and unsolved problems."
USENIX ATC 2017: Visualizing Performance with Flame GraphsBrendan Gregg
Talk by Brendan Gregg for USENIX ATC 2017.
"Flame graphs are a simple stack trace visualization that helps answer an everyday problem: how is software consuming resources, especially CPUs, and how did this change since the last software version? Flame graphs have been adopted by many languages, products, and companies, including Netflix, and have become a standard tool for performance analysis. They were published in "The Flame Graph" article in the June 2016 issue of Communications of the ACM, by their creator, Brendan Gregg.
This talk describes the background for this work, and the challenges encountered when profiling stack traces and resolving symbols for different languages, including for just-in-time compiler runtimes. Instructions will be included generating mixed-mode flame graphs on Linux, and examples from our use at Netflix with Java. Advanced flame graph types will be described, including differential, off-CPU, chain graphs, memory, and TCP events. Finally, future work and unsolved problems in this area will be discussed."
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. 2
Computing Performance: On the Horizon (Brendan Gregg)
This is
●
a performance engineer's views about industry-wide server performance
This isn't
●
necessarily about my employer, employer’s views, or USENIX's views
●
an endorsement of any company/product or sponsored by anyone
●
professional market predictions (various companies sell such reports)
●
based on confidential materials
●
necessarily correct or fit for any purpose
About this talk
My predictions may be wrong! They will be thought-provoking.
3. 3
Computing Performance: On the Horizon (Brendan Gregg)
1. Processors
2. Memory
3. Disks
4. Networking
5. Runtimes
6. Kernels
7. Hypervisors
8. Observability
Not covering: Databases, file systems, front-end, mobile, desktop.
Agenda
Slides are online and include extra details as fine print
Slides: http://www.brendangregg.com/Slides/LISA2021_ComputingPerformance.pdf
Video: https://www.usenix.org/conference/lisa21/presentation/gregg-computing
5. 5
Computing Performance: On the Horizon (Brendan Gregg)
Clock rate
Year Processor GHz
1978 Intel 8086 0.008
1985 Intel 386 DX 0.02
1993 Intel Pentium 0.06
1995 Pentium Pro 0.20
1999 Pentium III 0.50
2001 Intel Xeon 1.70
Early Intel Processors
1978 1983 1988 1993 1998
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Max
GHz
6. 6
Computing Performance: On the Horizon (Brendan Gregg)
Clock rate
Increase has leveled off due to power/efficiency
●
Workstation processors higher; E.g., 2020 Xeon W-1270P @ 5.1 GHz
Horizontal scaling instead
●
More CPU cores, hardware threads, and server instances
Year Processor Cores/T. Max GHz
2009 Xeon X5550 4/8 3.06
2012 Xeon E5-2665 0 8/16 3.10
2013 Xeon E5-2680 v2 10/20 3.60
2017 Platinum 8175M 24/48 3.50
2019 Platinum 8259CL 24/48 3.50
Server Processor Examples (AWS EC2)
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0
0.5
1
1.5
2
2.5
3
3.5
4
0
10
20
30
40
50
60
Threads
Max GHz
Max
GHz
Hardware
Threads
7. 7
Computing Performance: On the Horizon (Brendan Gregg)
Interconnect rates
10 years:
●
3.25x bus rate
●
6x core count
Memory bus (covered later) also lagging
Year CPU Interconnect Bandwidth
Gbytes/s
2007 Intel FSB 12.8
2008 Intel QPI 25.6
2017 Intel UPI 41.6
Source: Systems Performance 2nd
Edition
Figure 6.10 [Gregg 20]
8. 8
Computing Performance: On the Horizon (Brendan Gregg)
CPU Utilization is Wrong
It includes stall cycles
●
Memory stalls (local and remote over an interconnect), resource stalls.
●
It’s a fundamental metric. We shouldn’t need to “well, actually”-explain it.
Workloads often memory bound
●
I see instructions-per-cycle (IPC) of 0.2 - 0.8 (practical max 1.5, current microbenchmark max 4.0).
Faster CPUs just mean more stalls
90% CPU
...may mean:
9. 9
Computing Performance: On the Horizon (Brendan Gregg)
Lithography
2009 2012 2014 2016 2018 2020 2022 2023
0
5
10
15
20
25
30
35
Semiconductor Nanometer Process
nm
TSMC expects volume production
of 3nm in 2022 [Quach 21a]
Source: Semiconductor device fabrication [Wikipedia 21a]
2nm
3nm
32nm
Lithography limits expected to
be reached by 2029, switching
to stacked CPUs. [Moore 20]
BTW: Silicon atom radius ~0.1 nm [Wikipedia 21b]
IBM has already built one [Quach 21b]
10. 10
Computing Performance: On the Horizon (Brendan Gregg)
Lithography
2009 2012 2014 2016 2018 2020 2022 2023
0
5
10
15
20
25
30
35
Semiconductor Nanometer Process
nm
TSMC expects volume production
of 3nm in 2022 [Quach 21a]
Source: Semiconductor device fabrication [Wikipedia 21a]
2nm
3nm
32nm
Lithography limits expected to
be reached by 2029, switching
to stacked CPUs. [Moore 20]
IBM has already built one [Quach 21b]
(it has 12nm gate length)
“Nanometer
process”
since 2010 should be
considered a
marketing term
New terms proposed include:
●
GMT (gate pitch, metal pitch, tiers)
●
LMC (logic, memory, interconnects)
[Moore 20]
BTW: Silicon atom radius ~0.1 nm [Wikipedia 21b]
11. 11
Computing Performance: On the Horizon (Brendan Gregg)
Other processor scaling
Special instructions
●
E.g., AVX-512 Vector Neural Network Instructions (VNNI)
Connected chiplets
●
Using embedded multi-die interconnect bridge (EMIB) [Alcorn 17]
3D stacking
12. 12
Computing Performance: On the Horizon (Brendan Gregg)
Latest server processor examples
Vendor Processor Process Clock Cores/T. LLC
Mbytes
Date
Intel Xeon Platinum
8380 (Ice Lake)
“10nm” 2.3 - 3.4 40/80 60 Apr 2021
AMD EPYC 7713P “7nm” 2.0 - 3.675 64/128 256 Mar 2021
ARM-
based
Ampere Altra
Q80-33
“7nm” 3.3 80/80 32 Dec 2020
Coming soon to a datacenter near you
(Although there is a TSMC chip shortage that may last through to 2022/2023 [Quatch 21][Ridley 21])
13. 13
Computing Performance: On the Horizon (Brendan Gregg)
Amazon ARM/Graviton2
●
ARM Neoverse N1, 64 core, 2.5 GHz
Microsoft ARM
●
ARM-based something coming soon [Warren 20]
Google SoC
●
Systems-on-Chip (SoC) coming soon [Vahdat 21]
Cloud chip race
x86 AMD
Grav2
GOOG MSFT
ARM
?
Cloud processors
Generic processors
14. 14
Computing Performance: On the Horizon (Brendan Gregg)
Latest benchmark results
Graviton2 is promising
●
“We tested the new M6g instances using industry standard LMbench and certain Java benchmarks and saw
up to 50% improvement over M5 instances.” – Ed Hunter, Director of performance and operating systems at
Netflix.
Full results still under construction
●
It’s weeks of work to understand the latest differences and fix various things so that the performance as it will
be in production can be measured, and also verify results with real production workloads.
15. 15
Computing Performance: On the Horizon (Brendan Gregg)
Accelerators
GPUs
●
Parallel workloads, thousands of GPU cores. Widespread adoption in machine learning.
FPGAs
●
Reprogrammable semiconductors
●
Great potential, but needs specialists to program
●
Good for algorithms: compression, cryptocurrency,
video encoding, genomics, search, etc.
●
Microsoft FPGA-based configurable cloud [Russinovich 17]
Also TPUs
●
Tensor processing units [Google 21]
CPU
GPU
FPGA
Performance potential
Ease
of
use
The “other” CPUs you
may not be monitoring
16. 16
Computing Performance: On the Horizon (Brendan Gregg)
Latest GPU examples
NVIDIA GeForce RTX 3090: 10,496 CUDA cores, 2020
●
[Burnes 20]
Cerebras Gen2 WSE: 850,000 AI-optimized cores, 2021
●
Use most of the silicon wafer for one chip.
2.6 trillion transistors, 23 kW. [Trader 21]
●
Previous version was already the “Largest chip ever built,”
and US$2M. [insideHPC 20]
SM: Streaming multiprocessor
SP: Streaming processor
GPU
Cores (SPs)
Control
L1
SM
Control
L1
SM
Control
L1
SM
Control
L1
SM
...
L2 Cache
17. 17
Computing Performance: On the Horizon (Brendan Gregg)
Latest FPGA examples
Xilinx Virtex UltraScale+ VU19P, 8,938,000 logic cells, 2019
●
Using 35B transistors. Also has 4.5 Tbit/s transceiver bandwidth (bidir), and 1.5 Tbit/sec DDR4 bandwidth
[Cutress 19]
Xilinx Virtex UltraScale+ VU9P, 2,586,000 logic cells, 2016
●
Deploy right now: AWS EC2 F1 instance type (up to 8 of these FPGAs per instance)
BPF (covered later) already in FPGAs
●
E.g., 400 Gbit/s packet filter FFShark [Vega 20]
FPGA
19. 19
Computing Performance: On the Horizon (Brendan Gregg)
●
Single socket is getting big enough (cores)
●
Already scaling horizontally (cloud)
– And in datacenters, via “blades” or “microservers”
●
Why pay NUMA costs?
– Two single-socket instances should out-perform one two-socket instance
Multi-socket future is mixed: one socket for cores, one GPU socket, one FPGA socket, etc. EMIB connected.
Cloud
My Prediction: Multi-socket is doomed
CPU CPU
Mem Mem
1 hop 2 hops
One
socket
<100
cores
>100
cores
20. 20
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: SMT future unclear
Simultaneous multithreading (SMT) == hardware threads
●
Performance variation
●
ARM cores competitive
●
Post meltdown/spectre
– Some people turn them off
Possibilities:
●
SMT becomes “free”
– Processor feature, not a cost basis
– Turn “oh no! hardware threads” into
“great! bonus hardware threads!”
●
No more hardware threads
– Future investment elsewhere
21. 21
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Core count limits
Imagine an 850,000-core server processor in today’s systems...
22. 22
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Core count limits
Worsening problems:
●
Memory-bound workloads
●
Kernel/app lock contention
●
False sharing
●
Power consumption
●
etc.
General-purpose computing will hit a practical core limit
●
For a given memory subsystem & kernel, and running multiple applications
●
E.g., 1024 cores (except GPUs/ML/AI)
●
Apps themselves will hit an even smaller practical limit (some already have by design, e.g., Node.js and 2 CPUs)
Source:
Figure 2.16
[Gregg 20]
23. 23
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: 3 Eras of processor scaling
Delivered processor characteristics:
Era 1: Clock frequency
Era 2: Core/thread count
Era 3: Cache size & policy
24. 24
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: 3 Eras of processor scaling
Practical server limits:
Era 1: Clock frequency → already reached by ~2005 (3.5 GHz)
Era 2: Core/thread count → limited by mid 2030s (e.g., 1024)
Era 3: Cache size & policy → limited by end of 2030s
Mid-century will need an entirely new computer hardware architecture, kernel memory architecture, or logic gate
technology, to progress further.
●
E.g., use of graphine, carbon nanotubes [Hruska 12]
●
This is after moving more to stacked processors
25. 25
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: More processor vendors
ARM licensed
Era of CPU choice
Beware: “optimizing for the benchmark”
●
Don’t believe microbenchmarks without doing
“active benchmarking”: Root-cause perf analysis
while the benchmark is still running.
0 1 2 3
6
5
4 7
LLC MMU
Benchmark Optimizer Unit
(confidential)
Cores
DogeCPU “+AggressiveOpts” processor
26. 26
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Cloud CPU advantage
Large cloud vendors can analyze >100,000 workloads directly
●
Via PMCs and other processor features.
Vast real-world detail to aid processor design
●
More detail than traditional processor vendors have, and detail available immediately whenever they want.
●
Will processor vendors offer their own clouds just to get the same data?
Machine-learning aided processor design
●
Based on the vast detail. Please point it at real-world workloads and not microbenchmarks.
# perf script --insn-trace --xed
date 31979 [003] 653971.670163672: ... (/lib/x86_64-linux-gnu/ld-2.27.so) mov %rsp, %rdi
date 31979 [003] 653971.670163672: ... (/lib/x86_64-linux-gnu/ld-2.27.so) callq 0x7f3bfbf4dea0
date 31979 [003] 653971.670163672: ... (/lib/x86_64-linux-gnu/ld-2.27.so) pushq %rbp
[...]
Vast detail example: processor trace showing timestamped instructions:
27. 27
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: FPGA turning point
Little adoption (outside cryptocurrency) until major app support
●
Solves the ease of use issue: Developers just configure the app (which may fetch and deploy an FMI)
●
BPF use cases are welcome, but still specialized/narrow
●
Needs runtime support, e.g., the JVM. Already work in this area (e.g., [TornadoVM 21]).
JVM
JVM
FPGA
Accelerator
apt install openjdk-21
apt install openjdk-21-fpga
(none of this is real, yet)
java -XX:+UseFPGA
32. 32
Computing Performance: On the Horizon (Brendan Gregg)
DDR latency
Hasn’t changed in 20 years
●
This is single access latency
●
Same memory clock (200 MHz) [Greenberg 11]
●
Also see [Cutress 20][Goering 11]
Low-latency DDR does exist
●
Reduced Latency DRAM (RLDRAM) by Infineon
and Micron: lower latency but lower density
●
Not seeing widespread server use (I’ve seen it
marketed towards HFT)
Year Memory Latency (ns)
2000 DDR-333 15
2003 DDR2-800 15
2007 DDR3-1600 13.75
2012 DDR4-3200 13.75
2020 DDR5-6400 14.38
1998 2002 2006 2010 2014 2018 2022
0
2
4
6
8
10
12
14
Latency
(ns)
:-(
DDR-333 DDR5-6400
33. 33
Computing Performance: On the Horizon (Brendan Gregg)
HBM
High bandwidth memory, 3D stacking
●
Target uses cases include high performance computing, and virtual reality graphical processing [Macri 15]
GPUs already use it
Can be provided on-package
●
Intel Sapphire Rapids rumored to include 64 Gbyte HBM2E [Shilov 21d]
34. 34
Computing Performance: On the Horizon (Brendan Gregg)
Server DRAM size
SuperMicro SuperServer B12SPE-CPU-25G
●
Single Socket (see earlier slides)
●
16 DIMM slots
●
4 TB DDR-4
[SuperMicro 21]
Facebook Delta Lake (1S) OCP
●
6 DIMM slots
●
96 Gbytes DDR-4
●
Price/optimal for a typical WSS?
[Haken 21]
Processor
socket
DIMMs DIMMs
B12SPE-CPU-25G
35. 35
Computing Performance: On the Horizon (Brendan Gregg)
Additional memory tier
3D XPoint (next section) memory mode:
- Can also operate in application direct mode and storage mode [Intel 21]
Main
memory
Persistent memory
Storage devices
DRAM (<0.1us)
3D XPoint (<1us)
SSDs (<100us)
HHDs (<1ms)
36. 36
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Extrapolation
Not a JEDEC announcement
Assumes miraculous
engineering work
●
For various challenges see [Peterson 20]
But will single-access latency
drop in DDR-6?
●
I’d guess not, DDR internals are already at
their cost-sweet-spot, leaving low-latency
for other memory technologies
Year Memory Peak Bandwidth
Gbytes/s
2000 DDR-333 2.67
2003 DDR2-800 6.4
2007 DDR3-1600 12.8
2012 DDR4-3200 25.6
2020 DDR5-6400 51.2
2028 DDR6-12800 102.4
2036 DDR7-25600 204.8
2044 DDR8-51200 409.6
doubling
37. 37
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: DDR5 “up to 2x” Wins
E.g., IPC 0.1 → ~0.2 for bandwidth-bound workloads
If DDR-6 gets a latency drop, more frequent wins
“bad”
“good*”
<0.2
>2.0 Instruction bound
IPC
Stall-cycle bound
* probably; exceptions include spin locks
38. 38
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: HBM-only servers
Clouds offering “high bandwidth memory” HBM-only
instances
●
HBM on-processor
●
Finally helping memory catch up to core scaling
RLDRAM on-package as another option?
●
“Low latency memory” instance
39. 39
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Extra tier too late
Competition isn’t disks, it’s Tbytes of DRAM
●
SuperMicro’s single socket should hit 8 Tbytes DDR-5
●
AWS EC2 p4.24xl has 1.1 Tbytes of DRAM (deploy now!)
How often does your working set size (WSS) not fit?
Across several of these for redundancy?
●
Next tier needs to get much bigger than DRAM (10+x)
and much cheaper to find an extra-tier use case
(e.g., cost based).
●
Meanwhile, DRAM is still getting bigger and faster
●
I developed the first cache tier between main memory
and disks to see widespread use:
the ZFS L2ARC [Gregg 08]
Main
memory
Persistent memory
Storage devices
WSS
It’s more like a trapezoid
“cold”
data
?
<0.1us
<1us
<100
us
41. 41
Computing Performance: On the Horizon (Brendan Gregg)
Recent timeline for rotational disks
2005: Perpendicular magnetic recording (PMR)
●
Writes vertically using a shaped magnetic field for higher density
2013: Shingled magnetic recording (SMR)
●
(next slide)
2019: Multi-actuator technology (MAT)
●
Two sets of heads and actuators; like 2-drive RAID 0 [Alcorn 17].
2020: Energy-assisted magnetic recording (EAMR)
●
Western Digital 18TB & 20TB [Salter 20]
2021: Heat-assisted magnetic recording (HAMR)
●
Seagate 20TB HAMR drives [Shilov 21b]
42. 42
Computing Performance: On the Horizon (Brendan Gregg)
Recent timeline for rotational disks
2005: Perpendicular magnetic recording (PMR)
●
Writes vertically using a shaped magnetic field for higher density
2013: Shingled magnetic recording (SMR)
●
(next slide)
2019: Multi-actuator technology (MAT)
●
Two sets of heads and actuators; like 2-drive RAID 0 [Alcorn 17].
2020: Energy-assisted magnetic recording (EAMR)
●
Western Digital 18TB & 20TB [Salter 20]
2021: Heat-assisted magnetic recording (HAMR)
●
Seagate 20TB HAMR drives [Shilov 21b]
I don’t know their perf characteristics yet
43. 43
Computing Performance: On the Horizon (Brendan Gregg)
SMR
11-25% more storage, worse performance
●
Writes tracks in an overlapping way, like shingles on a roof. [Shimpi 13]
●
Overwritten data must be rewritten. Suited for archival (write once) workloads.
Look out for 18TB/20TB-with-SMR drive releases
Written tracks
Read head
44. 44
Computing Performance: On the Horizon (Brendan Gregg)
Flash memory-based disks
Single-Level Cell (SLC)
Multi-Level Cell (MLC)
Enterprise MLC (eMLC)
2009: Tri-Level Cell (TLC)
2009: Quad-Level Cell (QLC)
●
QLC is only rated for around 1,000 block-erase cycles [Liu 20].
2013: 3D NAND / Vertical NAND (V-NAND)
●
SK Hynix envisions 600-Layer 3D NAND [Shilov 21c]. Should be multi-Tbyte.
SSD performance pathologies: latency from aging, wear-leveling, fragmentation, internal compression, etc.
45. 45
Computing Performance: On the Horizon (Brendan Gregg)
Persistent memory-based disks
2017: 3D XPoint (Intel/Micron) Optane
●
Low and consistent latency (e.g., 14 us access latency) [Hady 18]
●
App-direct mode, memory mode, and as storage
Wordlines Memory cells Bitlines
+ + =
Cell selection
DRAM: Trapped electrons in a capacitor, requires refreshing
3D XPoint: Resistance change; layers of wordlines+cells+bitlines keep stacking vertically
3D XPoint
...
46. 46
Computing Performance: On the Horizon (Brendan Gregg)
Latest storage device example
2021: Intel Optane memory H20
●
QLC 3D NAND storage (512 Gbytes / 1 Tbyte) +
●
3D XPoint as an accelerator (32 Gbytes)
●
Currently M.2 2280 form factor (laptops)
●
(Announced while I was developing these slides)
47. 47
Computing Performance: On the Horizon (Brendan Gregg)
Storage Interconnects
SAS-4 cards in development
●
(Storage attached SCSI)
PCIe 5.0 coming soon
●
(Peripheral Component Interconnect Express)
●
Intel already demoed on Sapphire Rapids [Hruska 20]
NVMe 1.4 latest
●
(Non-Volatile Memory Express)
●
Storage over PCIe bus
●
Support zoned namespace SSDs (ZNS) [ZonedStorage 21]
●
Bandwidth bounded by PCIe bus
These have features other than speed
●
Reliability, power management, virtualization support, etc.
Year
Specified
Interface Bandwidth
Gbit/s
2003 SAS-1 3
2009 SAS-2 6
2012 SAS-3 12
2017 SAS-4 22.5
202? SAS-5 45
Year
Specified
Interface Bandwidth 16
lane Gbyte/s
2003 PCIe 1 4
2007 PCIe 2 8
2010 PCIe 3 16
2017 PCIe 4 31.5
2019 PCIe 5 63
48. 48
Computing Performance: On the Horizon (Brendan Gregg)
Linux Kyber I/O scheduler
Multi-queue, target read & write latency
●
Up to 300x lower 99th percentile latencies [Gregg 18]
●
Linux 4.12 [Corbet 17]
reads (sync) dispatch
writes (async) dispatch
completions
queue size adjust
Kyber (simplified)
49. 49
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Slower rotational
Archive focus
●
There’s ever-increasing demand for storage (incl. social video today; social VR tomorrow?)
●
Needed for archives
●
More “weird” pathologies. SMR is just the start.
●
Even less tolerant to shouting
Bigger, slower, and weirder
50. 50
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: 3D XPoint
As a rotational disk accelerator
As petabyte storage
●
Layers keep stacking
●
3D NAND could get to petabytes too, but consumes more power
●
1 Pbyte = ~700M 3.5inch floppies!
And not really as a memory tier (DRAM too good) or widespread application direct (too much work when 3D
XPoint storage accelerators exist so apps can get benefits without changing anything)
51. 51
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: More flash pathologies
●
Worse internal lifetime
●
More wear-leveling & logic
●
More latency outliers
Bigger, faster, and weirder
We need more observability of flash drive internals
53. 53
Computing Performance: On the Horizon (Brendan Gregg)
Latest Hardware
400 Gbit/s in use
●
E.g., 400 Gbit/s switches/routers by Cisco and Juniper, tranceivers by Arista and Intel
●
AWS EC2 P4 instance type (deploy now!)
●
On PCI, needs PCIe 5
800 Gbit/s next
●
[Charlene 20]
●
Terabit Ethernet (1 Tbit/s) not far away
More NIC features
●
E.g., inline kTLS (TLS offload to the NIC), e.g., Mellanox ConnectX-6-Dx [Gallatin 19]
54. 54
Computing Performance: On the Horizon (Brendan Gregg)
Protocols
QUIC / HTTP/3
●
TCP-like sessions over (fast) UDP.
●
0-RTT connection handshakes. For clients that have previously communicated.
MP-TCP
●
Multipath TCP. Use multiple paths in parallel to improve throughput and reliability. RFC-8684 [Ford 20]
●
Linux support starting in 5.6.
55. 55
Computing Performance: On the Horizon (Brendan Gregg)
Linux TCP Congestion Control Algorithms
DCTCP
●
Data Center TCP. Linux 3.18. [Borkmann 14]
TCP NV
●
New Vegas. Linux 4.8
TCP BBR
●
Bottleneck Bandwidth and RTT (BBR) improves performance on packet loss networks [Cardwell 16]
●
With 1% packet loss, Netflix sees 3x better throughput [Gregg 18]
56. 56
Computing Performance: On the Horizon (Brendan Gregg)
Linux Network Stack
Queues/
Tuning
Source: Systems Performance 2nd
Edition, Figure 10.8 [Gregg 20]
57. 57
Computing Performance: On the Horizon (Brendan Gregg)
Linux TCP send path
Keeps adding performance features
Source: Systems Performance 2nd
Edition, Figure 10.11 [Gregg 20]
58. 58
Computing Performance: On the Horizon (Brendan Gregg)
Software
eXpress Data Path (XDP) (uses eBPF)
●
Programmable fast lane for networking. In the Linux kernel.
●
A role previously served by DPDK and kernel bypass.
Next plenary session: “Performance Analysis of XDP Programs”
– Zachary H. Jones, Verizon Media
59. 59
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: BPF in FPGAs
Massive I/O tranceiver capabilities
Netronome already did BPF in hardware
60. 60
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Cheap BPF routers
Linux + BPF + 400 GbE NIC
●
Cheap == commodity hardware
●
Use case from the beginning of eBPF (PLUMgrid)
61. 61
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: More demand for network perf
Apps increasingly network
Netflix 4K content
Remote work & video conferencing
VR tourism
63. 63
Computing Performance: On the Horizon (Brendan Gregg)
Latest Java
Sep 2018: Java 11 (LTS)
●
JEP 333 ZGC A Scalable Low-Latency Garbage Collector
●
JEP 331 Low-Overhead Heap Profiling
●
GC adaptive thread scaling
Sep 2021: Java 17 (LTS)
●
JEP 338: Vector API (JDK16)
●
Parallel GC improvements (JDK14)
●
Various other perf improvements (JDK12-17)
Java 11 includes JMH JDK microbenchmarks
[Redestad 19]
64. 64
Computing Performance: On the Horizon (Brendan Gregg)
My Predictions: Runtime features
FPGA as a compiler target
●
E.g., JVM c2 or Graal adding it as a compiler target, and becoming a compiler “killer” feature.
io_uring I/O libraries
●
Massively accelerate some I/O-bound workloads by switching libraries.
Adaptive runtime internals
●
I don’t want to pick between c2 and Graal. Let the runtime do both and pick fastest methods; ditto for testing
GC algorithms.
– Not unlike the ZFS ARC shadow-testing different cache algorithms.
1000-core scalability support
●
Runtime/library/model support to help programmers write code to scale to hundreds of cores
66. 66
Computing Performance: On the Horizon (Brendan Gregg)
Latest Kernels/OSes
Apr 2021: FreeBSD 13.0
May 2021: Linux 5.12
May 2021: Windows 10.0.19043.985
Jul? 2021: Linux 5.13 (in development)
67. 67
Computing Performance: On the Horizon (Brendan Gregg)
Recent Linux perf features
2021: Syscall user dispatch (5.11)
2020: Static calls to improve Spectre-fix (5.10)
2020: BPF on socket lookups (5.9)
2020: Thermal pressure (5.7)
2020: MultiPath TCP (5.6)
2019: MADV_COLD, MADV_PAGEOUT (5.4)
2019: io_uring (5.1)
2019: UDP GRO (5.0)
2019: Multi-queue I/O default (5.0)
2018: TCP EDT (4.20)
2018: PSI (4.20)
For 2016-2018, see my summary: [Gregg 18].
Includes CPU schedulers (thermal, topology);
Block I/O qdiscs; Kyber scheduler (earlier slide);
TCP congestion control algoritms (earlier slide); etc.
68. 68
Computing Performance: On the Horizon (Brendan Gregg)
Recent Linux perf features
2021: Syscall user dispatch (5.11)
2020: Static calls to improve Spectre-fix (5.10)
2020: BPF on socket lookups (5.9)
2020: Thermal pressure (5.7)
2020: MultiPath TCP (5.6)
2019: MADV_COLD, MADV_PAGEOUT (5.4)
2019: io_uring (5.1)
2019: UDP GRO (5.0)
2019: Multi-queue I/O default (5.0)
2018: TCP EDT (4.20)
2018: PSI (4.20)
For 2016-2018, see my summary: [Gregg 18].
Includes CPU schedulers (thermal, topology);
Block I/O qdiscs; Kyber scheduler (earlier slide);
TCP congestion control algoritms (earlier slide); etc.
69. 69
Computing Performance: On the Horizon (Brendan Gregg)
io_uring
Faster syscalls using
shared ring buffers
●
Send and completion ring buffers
●
Allows I/O to be batched and async
●
Primary use cases network and disk I/O
Apps
Kernel
syscalls io_uring
70. 70
Computing Performance: On the Horizon (Brendan Gregg)
eBPF Everywhere
[Thaler 21]
Plus eBPF for BSD projects already started.
71. 71
Computing Performance: On the Horizon (Brendan Gregg)
eBPF == BPF
2015:
●
BPF: Berkeley Packet Filter
●
eBPF: extended BPF
2021:
●
“Classic BPF”: Berkeley Packet Filter
●
BPF: A technology name (aka eBPF)
– Kernel engineers like to use “BPF”; companies “eBPF”.
This is what happens when you don’t have marketing professionals help name your product
73. Computing Performance: On the Horizon (Brendan Gregg) 73
https://twitter.com/srostedt/status/1177147373283418112
74. 74
Computing Performance: On the Horizon (Brendan Gregg)
Emerging BPF uses
Observability agents
Security agents
TCP congestion control algorithms
Kernel drivers
75. 75
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Future BPF Uses
File system buffering/readahead policies
CPU scheduler policies
Lightweight I/O-bound applications (e.g., proxies)
●
Or such apps can go to io_uring or FPGAs. “Three buses arrived at once.”
– When I did engineering at University: “people ride buses and electrons ride busses.” Unfortunately that
usage has gone out of fashion, otherwise it would have been clear which bus I was referring to!
76. 76
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Kernels become JITed
PGO/AutoFDO shows ~10% wins, but hard to manage
●
Performance-guided optimization (PGO) / Auto feedback-directed optimization (AutoFDO)
●
Some companies already do kernel PGO (Google [Tolvanen 20], Microsoft [Bearman 20])
●
We can't leave 10% on the table forever
Kernels PGO/JIT support by default, so it “just works.”
77. 77
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Kernel emulation often slow
I can run <kernel> apps under <other kernel>
by emulating <a bare-minimal set of> syscalls!
Cool project, but:
●
Missing latest kernel and perf features (E.g., Linux’s BPF, io_uring, WireGuard, etc. Plus certain syscall flags
return ENOTSUP. So it’s like a weird old fork of Linux.)
– Some exceptions: E.g., another kernel may have better hardware support, which may benefit apps more
than the loss of kernel capabilities.
●
Debugging and security challenges. Better ROI with lightweight VMs.
In other words, WSL2 >> WSL1
78. 78
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: OS performance
Linux: increasing complexity & worse perf defaults
●
Becomes so complex that it takes an OS team to make it perform well. This assumes that the defaults rot,
because no perf teams are running the defaults anymore to notice (e.g., high-speed network engineers
configure XDP and QUIC, and aren’t looking at defaults with TCP). A bit more room for a lightweight kernel
(e.g., BSD) with better perf defaults to compete. Similarities: Oracle DB vs MySQL; MULTICS vs UNIX.
BSD: high perf for narrow uses
●
Still serving some companies (including Netflix) very well thanks to tuned performance (see footnote on p124 of
[Gregg 20]). Path to growth is better EC2/Azure performance support, but it may take years before a big
customer (with a perf team) migrates and gets everything fixed. There are over a dozen of perf engineers
working on Linux on EC2; BSD needs at least one full time senior EC2 (not metal) perf engineer.
Windows: community perf improvements
●
BPF tracing support allows outsiders to root cause kernel problems like never before (beyond ETW/Xperf). Will
have a wave of finding “low hanging fruit” to begin with, improving perf and reliability.
79. 79
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Unikernels
Finally gets one compelling published use case
“2x perf for X”
But few people run X
●
Needs to be really kernel heavy, and not many workloads are. And there’s already a lot of competition for
reducing kernel overhead (BPF, io_uring, FPGAs, DPDK, etc.)
●
Once one use case is found, it may form a valuable community around X and Unikernels. But it needs the
published use case to start, preferably from a FAANG.
●
Does need to be 2x or more, not 20%, to overcome the cost of retooling everything, redoing all observability
metrics, profilers, etc. It’s not impossible, but not easy [Gregg 16].
●
More OS-research-style wins found from hybrid- and micro-kernels.
81. 81
Computing Performance: On the Horizon (Brendan Gregg)
Containers
Cgroup v2 rollout
Container scheduler adoption
●
Kubernetes, OpenStack, and more
●
Netflix develops its own called “Titus” [Joshi 18]
●
Price/performance gains: “Tetris packing” workloads without too much interference (clever scheduler)
Many perf tools still not “container aware”
●
Usage in a container not restricted to the container, or not permitted by default (needs CAP_PERFMON
CAP_SYS_PTRACE, CAP_SYS_ADMIN)
82. 82
Computing Performance: On the Horizon (Brendan Gregg)
Hardware Hypervisors
Source: Systems Performance 2nd
Edition, Figure 11.17 [Gregg 20]
84. 84
Computing Performance: On the Horizon (Brendan Gregg)
Lightweight VMs
Source: Systems Performance 2nd
Edition, Figure 11.4 [Gregg 20]
E.g., AWS “Firecracker”
85. 85
Computing Performance: On the Horizon (Brendan Gregg)
Perf tools take several years to be fully “container aware”
●
Includes non-root BPF work.
●
It’s a lot of work, and not enough engineers are working on it. We’ll use workarounds in the meantime (e.g.,
Kyle Anderson and Sargun Dhillon have made perf tools work in containers at Netflix).
●
Was the same with Solaris Zones (long slow process).
My Prediction: Containers
86. 86
Computing Performance: On the Horizon (Brendan Gregg)
Short term:
●
Containers everywhere
Long term:
●
More containers than VMs
●
More lightweight VM cores than container cores
– Hottest workloads switch to dedicated kernels (no kernel resource sharing, no seccomp overhead, no
overlay overhead, full perf tool access, PGO kernels, etc.)
My Prediction: Landscape
87. 87
Computing Performance: On the Horizon (Brendan Gregg)
My Prediction: Evolution
1.FaaS
2.Container
3.Lightweight VM
4.Metal
Light workload
Heavy workload
Many apps aren’t heavy
Metal can also mean single container on metal
88. 88
Computing Performance: On the Horizon (Brendan Gregg)
Microservice IPC cost drives need for:
●
Container schedulers co-locating chatty services
– With BPF-based accelerated networking between them (e.g., Cilium)
●
Cloud-wide runtime schedulers co-locating apps
– Multiple apps under one JVM roof and process address space
My Prediction: Cloud Computing
90. 90
Computing Performance: On the Horizon (Brendan Gregg)
USENIX LISA 2010: Heat maps
2021: Latency heat maps everywhere
[Gregg 10]
91. 91
Computing Performance: On the Horizon (Brendan Gregg)
USENIX LISA 2013: Flame graphs
2021: Flame graphs everywhere
[Gregg 13]
92. 92
Computing Performance: On the Horizon (Brendan Gregg)
USENIX LISA 2016: BPF
2021: BPF heading everywhere
[Gregg 16b]
93. 93
Computing Performance: On the Horizon (Brendan Gregg)
2014
I began working on Linux on EC2 perf
●
No DTrace
●
No PMCs
Another perf expert questioned my move: How will I do advanced perf if I can’t see anything?
Answer: I’ll be motivated to fix it.
94. 94
Computing Performance: On the Horizon (Brendan Gregg)
2021: Age of Seeing
●
BPF & bpftrace
●
PMCs in the cloud
99. Computing Performance: On the Horizon (Brendan Gregg) 99
libbpf-tools
# ./opensnoop
PID COMM FD ERR PATH
27974 opensnoop 28 0 /etc/localtime
1482 redis-server 7 0 /proc/1482/stat
[…]
# ldd opensnoop
linux-vdso.so.1 (0x00007ffddf3f1000)
libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f9fb7836000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9fb7619000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9fb7228000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9fb7c76000)
# ls -lh opensnoop opensnoop.stripped
-rwxr-xr-x 1 root root 645K Feb 28 23:18 opensnoop
-rwxr-xr-x 1 root root 151K Feb 28 23:33 opensnoop.stripped
●
151 Kbytes for a stand-alone BPF program!
●
(Note: A static bpftrace/BTF + scripts will also have a small average tool size)
100. 100
Computing Performance: On the Horizon (Brendan Gregg)
Modern Observability Stack
OpenTelemetry
●
Standard for monitoring and tracing
Prometheus
●
Monitoring database
Grafana
●
UI with dashboards
Source: Figure 1.4 [Gregg 20]
Grafana
101. 101
Computing Performance: On the Horizon (Brendan Gregg)
bpftrace
●
For one-liners and to hack up new tools
●
When you want to spend an afternoon developing some custom BPF tracing
libbpf-tools
●
For packaged BPF binary tools and BPF products
●
When you want to spend weeks developing BPF
My Prediction: BPF tool front-ends
102. 102
Computing Performance: On the Horizon (Brendan Gregg)
(I’m partly to blame)
2014: I have no tools for this problem
2024: I have too many tools for this problem
Tool creators: Focus on solving something no other tool can. Necessity is the mother of good BPF tools.
My Prediction: Too many BPF tools
103. Computing Performance: On the Horizon (Brendan Gregg) 103
My Prediction: BPF perf tool future
GUIs, not CLI tools
Tool output, visualized
This GUI is in development by Susie Xia, Netflix
The end user may not even know it’s using BPF
104. Computing Performance: On the Horizon (Brendan Gregg) 104
My Prediction: Flame scope adoption
[Spier 20]
Subsecond-offset heat map
Flame graph
Analyze variance, perturbations:
107. Computing Performance: On the Horizon (Brendan Gregg) 107
References
[Gregg 08] Brendan Gregg, “ZFS L2ARC,” http://www.brendangregg.com/blog/2008-07-22/zfs-l2arc.html, Jul 2008
[Gregg 10] Brendan Gregg, “Visualizations for Performance Analysis (and More),”
https://www.usenix.org/conference/lisa10/visualizations-performance-analysis-and-more, 2010
[Greenberg 11] Marc Greenberg, “DDR4: Double the speed, double the latency? Make sure your system can handle next-generation
DRAM,” https://www.chipestimate.com/DDR4-Double-the-speed-double-the-latencyMake-sure-your-system-can-handle-
next-generation-DRAM/Cadence/Technical-Article/2011/11/22, Nov 2011
[Hruska 12] Joel Hruska, “The future of CPU scaling: Exploring options on the cutting edge,”
https://www.extremetech.com/computing/184946-14nm-7nm-5nm-how-low-can-cmos-go-it-depends-if-you-ask-the-
engineers-or-the-economists, Feb 2012
[Gregg 13] Brendan Gregg, “Blazing Performance with Flame Graphs,”
https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg, 2013
[Shimpi 13] Anand Lal Shimpi, “Seagate to Ship 5TB HDD in 2014 using Shingled Magnetic Recording,”
https://www.anandtech.com/show/7290/seagate-to-ship-5tb-hdd-in-2014-using-shingled-magnetic-recording, Sep 2013
[Borkmann 14] Daniel Borkmann, “net: tcp: add DCTCP congestion control algorithm,”
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?
id=e3118e8359bb7c59555aca60c725106e6d78c5ce, 2014
[Macri 15] Joe Macri, “Introducing HBM,” https://www.amd.com/en/technologies/hbm, Jul 2015
[Cardwell 16] Neal Cardwell, et al., “BBR: Congestion-Based Congestion Control,” https://queue.acm.org/detail.cfm?id=3022184,
2016
[Gregg 16] Brendan Gregg, “Unikernel Profiling: Flame Graphs from dom0,” http://www.brendangregg.com/blog/2016-01-27/unikernel-
profiling-from-dom0.html, Jan 2016
108. Computing Performance: On the Horizon (Brendan Gregg) 108
References (2)
[Gregg 16b] Brendan Gregg, “Linux 4.X Tracing Tools: Using BPF Superpowers,”
https://www.usenix.org/conference/lisa16/conference-program/presentation/linux-4x-tracing-tools-using-bpf-superpowers,
2016
[Alcorn 17] Paul Alcorn, “Seagate To Double HDD Speed With Multi-Actuator Technology,” https://www.tomshardware.com/news/hdd-
multi-actuator-heads-seagate,36132.html, 2017
[Alcorn 17b] Paul Alcorn, “Hot Chips 2017: Intel Deep Dives Into EMIB,” https://www.tomshardware.com/news/intel-emib-interconnect-
fpga-chiplet,35316.html#xenforo-comments-3112212, 2017
[Corbet 17] Jonathan Corbet, “Two new block I/O schedulers for 4.12,” https://lwn.net/Articles/720675, Apr 2017
[Gregg 17] Brendan Gregg, “AWS EC2 Virtualization 2017: Introducing Nitro,” http://www.brendangregg.com/blog/2017-11-29/aws-
ec2-virtualization-2017.html, Nov 2017
[Russinovich 17] Mark Russinovich, “Inside the Microsoft FPGA-based configurable cloud,”
https://www.microsoft.com/en-us/research/video/inside-microsoft-fpga-based-configurable-cloud, 2017
[Gregg 18] Brendan Gregg, “Linux Performance 2018,” http://www.brendangregg.com/Slides/Percona2018_Linux_Performance.pdf,
2018
[Hady 18] Frank Hady, “Achieve Consistent Low Latency for Your Storage-Intensive Workloads,”
https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/low-latency-for-storage-
intensive-workloads-article-brief.html, 2018
[Joshi 18] Amit Joshi, et al., “Titus, the Netflix container management platform, is now open source,” https://netflixtechblog.com/titus-
the-netflix-container-management-platform-is-now-open-source-f868c9fb5436, Apr 2018
109. Computing Performance: On the Horizon (Brendan Gregg) 109
References (3)
[Cutress 19] Dr. Ian Cutress, “Xilinx Announces World Largest FPGA: Virtex Ultrascale+ VU19P with 9m Cells,”
https://www.anandtech.com/show/14798/xilinx-announces-world-largest-fpga-virtex-ultrascale-vu19p-with-9m-cells, Aug
2019
[Gallatin 19] Drew Gallatin, “Kernel TLS and hardware TLS offload in FreeBSD 13,”
https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf, 2019
[Redestad 19] Claes Redestad, Staffan Friberg, Aleksey Shipilev, “JEP 230: Microbenchmark Suite,” http://openjdk.java.net/jeps/230,
updated 2019
[Bearman 20] Ian Bearman, “Exploring Profile Guided Optimization of the Linux Kernel,”
https://linuxplumbersconf.org/event/7/contributions/771, 2020
[Burnes 20] Andrew Burnes, “GeForce RTX 30 Series Graphics Cards: The Ultimate Play,”
https://www.nvidia.com/en-us/geforce/news/introducing-rtx-30-series-graphics-cards, Sep 2020
[Charlene 20] Charlene, “800G Is Coming: Set Pace to More Higher Speed Applications,” https://community.fs.com/blog/800-gigabit-
ethernet-and-optics.html, May 2020
[Cutress 20] Dr. Ian Cutress, “Insights into DDR5 Sub-timings and Latencies,” https://www.anandtech.com/show/16143/insights-into-
ddr5-subtimings-and-latencies, Oct 2020
[Ford 20] A. Ford, et al., “TCP Extensions for Multipath Operation with Multiple Addresses,”
https://datatracker.ietf.org/doc/html/rfc8684, Mar 2020
[Gregg 20] Brendan Gregg, “Systems Performance: Enterprise and the Cloud, Second Edition,” Addison-Wesley, 2020
[Hruska 20] Joel Hruska, “Intel Demos PCIe 5.0 on Upcoming Sapphire Rapids CPUs,”
https://www.extremetech.com/computing/316257-intel-demos-pcie-5-0-on-upcoming-sapphire-rapids-cpus,
Oct 2020
110. Computing Performance: On the Horizon (Brendan Gregg) 110
References (4)
[Liu 20] Linda Liu, “Samsung QVO vs EVO vs PRO: What’s the Difference? [Clone Disk],” https://www.partitionwizard.com/clone-
disk/samsung-qvo-vs-evo.html, 2020
[Moore 20] Samuel K. Moore, “A Better Way to Measure Progress in Semiconductors,”
https://spectrum.ieee.org/semiconductors/devices/a-better-way-to-measure-progress-in-semiconductors, Jul 2020
[Peterson 20] Zachariah Peterson, “DDR5 vs. DDR6: Here's What to Expect in RAM Modules,” https://resources.altium.com/p/ddr5-
vs-ddr6-heres-what-expect-ram-modules, Nov 2020
[Salter 20] Jim Salter, “Western Digital releases new 18TB, 20TB EAMR drives,” https://arstechnica.com/gadgets/2020/07/western-
digital-releases-new-18tb-20tb-eamr-drives, Jul 2020
[Spier 20] Martin Spier, Brendan Gregg, et al., “FlameScope,” https://github.com/Netflix/flamescope, 2020
[Tolvanen 20] Sami Tolvanen, Bill Wendling, and Nick Desaulniers, “LTO, PGO, and AutoFDO in the Kernel,” Linux Plumber’s
Conference, https://linuxplumbersconf.org/event/7/contributions/798, 2020
[Vega 20] Juan Camilo Vega, Marco Antonio Merlini, Paul Chow, “FFShark: A 100G FPGA Implementation of BPF Filtering for
Wireshark,” IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM),
2020
[Warren 20] Tom Warren, “Microsoft reportedly designing its own ARM-based chips for servers and Surface PCs,”
https://www.theverge.com/2020/12/18/22189450/microsoft-arm-processors-chips-servers-surface-report, Dec 2020
[Google 21] Google, “Cloud TPU,” https://cloud.google.com/tpu, 2021
[Haken 21] Michael Haken, et al., “Delta Lake 1S Server Design Specification 1v05, https://www.opencompute.org/documents/delta-
lake-1s-server-design-specification-1v05-pdf, 2021
[Intel 21] Intel corporation, “Intel® OptaneTM Technology,” https://www.intel.com/content/www/us/en/products/docs/storage/optane-
technology-brief.html, 2021
111. Computing Performance: On the Horizon (Brendan Gregg) 111
References (5)
[Quach 21a] Katyanna Quach, “Global chip shortage probably won't let up until 2023, warns TSMC: CEO 'still expects capacity to
tighten more',” https://www.theregister.com/2021/04/16/tsmc_chip_forecast, Apr 2021
[Quach 21b] Katyanna Quach, “IBM says it's built the world's first 2nm semiconductor chips,”
https://www.theregister.com/2021/05/06/ibm_2nm_semiconductor_chips, May 2021
[Ridley 21] Jacob Ridley, “IBM agrees with Intel and TSMC: this chip shortage isn't going to end anytime soon,”
https://www.pcgamer.com/ibm-agrees-with-intel-and-tsmc-this-chip-shortage-isnt-going-to-end-anytime-soon, May 2021
[Shilov 21] Anton Shilov, “Samsung Develops 512GB DDR5 Module with HKMG DDR5 Chips,”
https://www.tomshardware.com/news/samsung-512gb-ddr5-memory-module, Mar 2021
[Shilov 21b] Anton Shilov, “Seagate Ships 20TB HAMR HDDs Commercially, Increases Shipments of Mach.2 Drives,”
https://www.tomshardware.com/news/seagate-ships-hamr-hdds-increases-dual-actuator-shipments, 2021
[Shilov 21c] Anton Shilov, “SK Hynix Envisions 600-Layer 3D NAND & EUV-Based DRAM,” https://www.tomshardware.com/news/sk-
hynix-600-layer-3d-nand-euv-dram, Mar 2021
[Shilov 21d] Anton Shilov, “Sapphire Rapids Uncovered: 56 Cores, 64GB HBM2E, Multi-Chip Design,”
https://www.tomshardware.com/news/intel-sapphire-rapids-xeon-scalable-specifications-and-features, Apr 2021
[SuperMicro 21] SuperMicro, “B12SPE-CPU-25G (For SuperServer Only),”
https://www.supermicro.com/en/products/motherboard/B12SPE-CPU-25G, 2021
[Thaler 21] Dave Thaler, Poorna Gaddehosur, “Making eBPF work on Windows,”
https://cloudblogs.microsoft.com/opensource/2021/05/10/making-ebpf-work-on-windows, May 2021
[TornadoVM 21] TornadoVM, “TornadoVM Run your software faster and simpler!” https://www.tornadovm.org, 2021
112. Computing Performance: On the Horizon (Brendan Gregg) 112
References (6)
[Trader 21] Tiffany Trader, “Cerebras Second-Gen 7nm Wafer Scale Engine Doubles AI Performance Over First-Gen Chip ,”
https://www.enterpriseai.news/2021/04/21/latest-cerebras-second-gen-7nm-wafer-scale-engine-doubles-ai-performance-
over-first-gen-chip, Apr 2021
[Vahdat 21] Amin Vahdat, “The past, present and future of custom compute at Google,”
https://cloud.google.com/blog/topics/systems/the-past-present-and-future-of-custom-compute-at-google, Mar 2021
[Wikipedia 21] “Semiconductor device fabrication,” https://en.wikipedia.org/wiki/Semiconductor_device_fabrication, 2021
[Wikipedia 21b] “Silicon,” https://en.wikipedia.org/wiki/Silicon, 2021
[ZonedStorage 21] Zoned Storage, “Zoned Namespaces (ZNS) SSDs,” https://zonedstorage.io/introduction/zns, 2021
113. 113
Computing Performance: On the Horizon (Brendan Gregg)
Thanks
Thanks for watching!
Slides: http://www.brendangregg.com/Slides/LISA2021_ComputingPerformance.pdf
Video: https://www.usenix.org/conference/lisa21/presentation/gregg-computing
Thanks to colleagues Jason Koch, Sargun Dhillon, and Drew Gallatin for their
performance engineering expertise.
Thanks to USENIX and the LISA organizers!
USENIX
Jun, 2021