Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
LOPSA SD 2014.03.27 Presentation on Linux Performance Analysis
An introduction using the USE method and showing how several tools fit into those resource evaluations.
Talk for QConSF 2015: "Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of system performance tools, touring common problems with system metrics, monitoring, statistics, visualizations, measurement overhead, and benchmarks. This will likely involve some unlearning, as you discover tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many popular talks on operating system performance tools. This is an anti-version of these talks, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice and methodologies for verifying new performance tools, understanding how they work, and using them successfully."
Systems Performance: Enterprise and the CloudBrendan Gregg
My talk for BayLISA, Oct 2013, launching the Systems Performance book. Operating system performance analysis and tuning leads to a better end-user experience and lower costs, especially for cloud computing environments that pay by the operating system instance. This book covers concepts, strategy, tools and tuning for Unix operating systems, with a focus on Linux- and Solaris-based systems. The book covers the latest tools and techniques, including static and dynamic tracing, to get the most out of your systems.
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
LOPSA SD 2014.03.27 Presentation on Linux Performance Analysis
An introduction using the USE method and showing how several tools fit into those resource evaluations.
Talk for QConSF 2015: "Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of system performance tools, touring common problems with system metrics, monitoring, statistics, visualizations, measurement overhead, and benchmarks. This will likely involve some unlearning, as you discover tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many popular talks on operating system performance tools. This is an anti-version of these talks, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice and methodologies for verifying new performance tools, understanding how they work, and using them successfully."
Systems Performance: Enterprise and the CloudBrendan Gregg
My talk for BayLISA, Oct 2013, launching the Systems Performance book. Operating system performance analysis and tuning leads to a better end-user experience and lower costs, especially for cloud computing environments that pay by the operating system instance. This book covers concepts, strategy, tools and tuning for Unix operating systems, with a focus on Linux- and Solaris-based systems. The book covers the latest tools and techniques, including static and dynamic tracing, to get the most out of your systems.
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
Talk by Brendan Gregg for OSSNA 2017. "Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will be a dive deep on these new tracing, observability, and debugging capabilities, which sooner or later will be available to everyone who uses Linux. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
A brief talk on systems performance for the July 2013 meetup "A Midsummer Night's System", video: http://www.youtube.com/watch?v=P3SGzykDE4Q. This summarizes how systems performance has changed from the 1990's to today. This was the reason for writing a new book on systems performance, to provide a reference that is up to date, covering new tools, technologies, and methodologies.
zfsday talk (a video is on the last slide). The performance of the file system, or disks, is often the target of blame, especially in multi-tenant cloud environments. At Joyent we deploy a public cloud on ZFS-based systems, and frequently investigate performance with a wide variety of applications in growing environments. This talk is about ZFS performance observability, showing the tools and approaches we use to quickly show what ZFS is doing. This includes observing ZFS I/O throttling, an enhancement added to illumos-ZFS to isolate performance between neighbouring tenants, and the use of DTrace and heat maps to examine latency distributions and locate outliers.
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
Ftrace’s most powerful feature is the function tracer (and function graph tracer which is built from it). But to have this enabled on production systems, it had to have its overhead be negligible when disabled. As the function tracer uses gcc’s profiling mechanism, which adds a call to “mcount” (or more recently fentry, don’t worry if you don’t know what this is, it will all be explained) at the start of almost all functions, it had to do something about the overhead that causes. The solution was to turn those calls into “nops” (an instruction that the CPU simply ignores). But this was no easy feat. It took a lot to come up with a solution (and also turning a few network cards into bricks). This talk will explain the history of how ftrace came about implementing the function tracer, and brought with it the possibility of static branches and soon static calls!
Steven Rostedt
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
Talk delivered at SCaLE10x, Los Angeles 2012.
Cloud Computing introduces new challenges for performance
analysis, for both customers and operators of the cloud. Apart from
monitoring a scaling environment, issues within a system can be
complicated when tenants are competing for the same resources, and are
invisible to each other. Other factors include rapidly changing
production code and wildly unpredictable traffic surges. For
performance analysis in the Joyent public cloud, we use a variety of
tools including Dynamic Tracing, which allows us to create custom
tools and metrics and to explore new concepts. In this presentation
I'll discuss a collection of these tools and the metrics that they
measure. While these are DTrace-based, the focus of the talk is on
which metrics are proving useful for analyzing real cloud issues.
"Performance Analysis Methodologies", USENIX/LISA12, San Diego, 2012
Performance analysis methodologies provide guidance, save time, and can find issues that are otherwise overlooked. Example issues include hardware bus saturation, lock contention, recoverable device errors, kernel scheduling issues, and unnecessary workloads. The talk will focus on the USE Method: a simple strategy for all staff for performing a complete check of system performance health, identifying common bottlenecks and errors. Other analysis methods discussed include workload characterization, drill-down analysis, and latency analysis, with example applications from enterprise and cloud computing. Don’t just reach for tools—use a method!
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
Ftrace is the official tracer of the Linux kernel. It has been apart of Linux since 2.6.31, and has grown tremendously ever since. Ftrace’s name comes from its most powerful feature: function tracing. But the ftrace infrastructure is much more than that. It also encompasses the trace events that are used by perf, as well as kprobes that can dynamically add trace events that the user defines.
This talk will focus on learning how the kernel works by using the ftrace infrastructure. It will show how to see what happens within the kernel during a system call; learn how interrupts work; see how ones processes are being scheduled, and more. A quick introduction to some tools like trace-cmd and KernelShark will also be demonstrated.
Steven Rostedt, VMware
Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.
Tracing Summit 2014, Düsseldorf. What can Linux learn from DTrace: what went well, and what didn't go well, on its path to success? This talk will discuss not just the DTrace software, but lessons from the marketing and adoption of a system tracer, and an inside look at how DTrace was really deployed and used in production environments. It will also cover ongoing problems with DTrace, and how Linux may surpass them and continue to advance the field of system tracing. A world expert and core contributor to DTrace, Brendan now works at Netflix on Linux performance with the various Linux tracers (ftrace, perf_events, eBPF, SystemTap, ktap, sysdig, LTTng, and the DTrace Linux ports), and will summarize his experiences and suggestions for improvements. He has also been contributing to various tracers: recently promoting ftrace and perf_events adoption through articles and front-end scripts, and testing eBPF.
From USENIX LISA 2010, San Jose.
Visualizations that include heat maps can be an effective way to present performance data: I/O latency, resource utilization, and more. Patterns can emerge that would be difficult to notice from columns of numbers or line graphs, which are revealing previously unknown behavior. These visualizations are used in a product as a replacement for traditional metrics such as %CPU and are allowing end users to identify more issues much more easily (and some issues are becoming nearly impossible to identify with tools such as vmstat(1)). This talk covers what has been learned, crazy heat map discoveries, and thoughts for future applications beyond performance analysis.
Computing Performance: On the Horizon (2021)Brendan Gregg
Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Delivered at the FISL13 conference in Brazil: http://www.youtube.com/watch?v=K9w2cipqfvc
This talk introduces the USE Method: a simple strategy for performing a complete check of system performance health, identifying common bottlenecks and errors. This methodology can be used early in a performance investigation to quickly identify the most severe system performance issues, and is a methodology the speaker has used successfully for years in both enterprise and cloud computing environments. Checklists have been developed to show how the USE Method can be applied to Solaris/illumos-based and Linux-based systems.
Many hardware and software resource types have been commonly overlooked, including memory and I/O busses, CPU interconnects, and kernel locks. Any of these can become a system bottleneck. The USE Method provides a way to find and identify these.
This approach focuses on the questions to ask of the system, before reaching for the tools. Tools that are ultimately used include all the standard performance tools (vmstat, iostat, top), and more advanced tools, including dynamic tracing (DTrace), and hardware performance counters.
Other performance methodologies are included for comparison: the Problem Statement Method, Workload Characterization Method, and Drill-Down Analysis Method.
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
Talk by for Velocity 2017 by Brendan Gregg: Performance analysis superpowers with Linux eBPF.
"Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will investigate this new technology, which sooner or later will be available to everyone who uses Linux. The talk will dive deep on these new tracing, observability, and debugging capabilities. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
Talk by Brendan Gregg for OSSNA 2017. "Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will be a dive deep on these new tracing, observability, and debugging capabilities, which sooner or later will be available to everyone who uses Linux. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
A brief talk on systems performance for the July 2013 meetup "A Midsummer Night's System", video: http://www.youtube.com/watch?v=P3SGzykDE4Q. This summarizes how systems performance has changed from the 1990's to today. This was the reason for writing a new book on systems performance, to provide a reference that is up to date, covering new tools, technologies, and methodologies.
zfsday talk (a video is on the last slide). The performance of the file system, or disks, is often the target of blame, especially in multi-tenant cloud environments. At Joyent we deploy a public cloud on ZFS-based systems, and frequently investigate performance with a wide variety of applications in growing environments. This talk is about ZFS performance observability, showing the tools and approaches we use to quickly show what ZFS is doing. This includes observing ZFS I/O throttling, an enhancement added to illumos-ZFS to isolate performance between neighbouring tenants, and the use of DTrace and heat maps to examine latency distributions and locate outliers.
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
Ftrace’s most powerful feature is the function tracer (and function graph tracer which is built from it). But to have this enabled on production systems, it had to have its overhead be negligible when disabled. As the function tracer uses gcc’s profiling mechanism, which adds a call to “mcount” (or more recently fentry, don’t worry if you don’t know what this is, it will all be explained) at the start of almost all functions, it had to do something about the overhead that causes. The solution was to turn those calls into “nops” (an instruction that the CPU simply ignores). But this was no easy feat. It took a lot to come up with a solution (and also turning a few network cards into bricks). This talk will explain the history of how ftrace came about implementing the function tracer, and brought with it the possibility of static branches and soon static calls!
Steven Rostedt
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
Talk delivered at SCaLE10x, Los Angeles 2012.
Cloud Computing introduces new challenges for performance
analysis, for both customers and operators of the cloud. Apart from
monitoring a scaling environment, issues within a system can be
complicated when tenants are competing for the same resources, and are
invisible to each other. Other factors include rapidly changing
production code and wildly unpredictable traffic surges. For
performance analysis in the Joyent public cloud, we use a variety of
tools including Dynamic Tracing, which allows us to create custom
tools and metrics and to explore new concepts. In this presentation
I'll discuss a collection of these tools and the metrics that they
measure. While these are DTrace-based, the focus of the talk is on
which metrics are proving useful for analyzing real cloud issues.
"Performance Analysis Methodologies", USENIX/LISA12, San Diego, 2012
Performance analysis methodologies provide guidance, save time, and can find issues that are otherwise overlooked. Example issues include hardware bus saturation, lock contention, recoverable device errors, kernel scheduling issues, and unnecessary workloads. The talk will focus on the USE Method: a simple strategy for all staff for performing a complete check of system performance health, identifying common bottlenecks and errors. Other analysis methods discussed include workload characterization, drill-down analysis, and latency analysis, with example applications from enterprise and cloud computing. Don’t just reach for tools—use a method!
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
Ftrace is the official tracer of the Linux kernel. It has been apart of Linux since 2.6.31, and has grown tremendously ever since. Ftrace’s name comes from its most powerful feature: function tracing. But the ftrace infrastructure is much more than that. It also encompasses the trace events that are used by perf, as well as kprobes that can dynamically add trace events that the user defines.
This talk will focus on learning how the kernel works by using the ftrace infrastructure. It will show how to see what happens within the kernel during a system call; learn how interrupts work; see how ones processes are being scheduled, and more. A quick introduction to some tools like trace-cmd and KernelShark will also be demonstrated.
Steven Rostedt, VMware
Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.
Tracing Summit 2014, Düsseldorf. What can Linux learn from DTrace: what went well, and what didn't go well, on its path to success? This talk will discuss not just the DTrace software, but lessons from the marketing and adoption of a system tracer, and an inside look at how DTrace was really deployed and used in production environments. It will also cover ongoing problems with DTrace, and how Linux may surpass them and continue to advance the field of system tracing. A world expert and core contributor to DTrace, Brendan now works at Netflix on Linux performance with the various Linux tracers (ftrace, perf_events, eBPF, SystemTap, ktap, sysdig, LTTng, and the DTrace Linux ports), and will summarize his experiences and suggestions for improvements. He has also been contributing to various tracers: recently promoting ftrace and perf_events adoption through articles and front-end scripts, and testing eBPF.
From USENIX LISA 2010, San Jose.
Visualizations that include heat maps can be an effective way to present performance data: I/O latency, resource utilization, and more. Patterns can emerge that would be difficult to notice from columns of numbers or line graphs, which are revealing previously unknown behavior. These visualizations are used in a product as a replacement for traditional metrics such as %CPU and are allowing end users to identify more issues much more easily (and some issues are becoming nearly impossible to identify with tools such as vmstat(1)). This talk covers what has been learned, crazy heat map discoveries, and thoughts for future applications beyond performance analysis.
Computing Performance: On the Horizon (2021)Brendan Gregg
Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Delivered at the FISL13 conference in Brazil: http://www.youtube.com/watch?v=K9w2cipqfvc
This talk introduces the USE Method: a simple strategy for performing a complete check of system performance health, identifying common bottlenecks and errors. This methodology can be used early in a performance investigation to quickly identify the most severe system performance issues, and is a methodology the speaker has used successfully for years in both enterprise and cloud computing environments. Checklists have been developed to show how the USE Method can be applied to Solaris/illumos-based and Linux-based systems.
Many hardware and software resource types have been commonly overlooked, including memory and I/O busses, CPU interconnects, and kernel locks. Any of these can become a system bottleneck. The USE Method provides a way to find and identify these.
This approach focuses on the questions to ask of the system, before reaching for the tools. Tools that are ultimately used include all the standard performance tools (vmstat, iostat, top), and more advanced tools, including dynamic tracing (DTrace), and hardware performance counters.
Other performance methodologies are included for comparison: the Problem Statement Method, Workload Characterization Method, and Drill-Down Analysis Method.
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
Talk by for Velocity 2017 by Brendan Gregg: Performance analysis superpowers with Linux eBPF.
"Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will investigate this new technology, which sooner or later will be available to everyone who uses Linux. The talk will dive deep on these new tracing, observability, and debugging capabilities. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
2011-12-08 Red Hat Enterprise Virtualization for Desktops (RHEV VDI) with Cis...Shawn Wells
Field call with Cisco on integrating Red Hat's Virtual Desktop capabilities with Cisco UCS hardware. Steps through underling virtualization technology, performance & scalability, and SPICE display protocol.
Caches are used in many layers of applications that we develop today, holding data inside or outside of your runtime environment, or even distributed across multiple platforms in data fabrics. However, considerable performance gains can often be realized by configuring the deployment platform/environment and coding your application to take advantage of the properties of CPU caches.
In this talk, we will explore what CPU caches are, how they work and how to measure your JVM-based application data usage to utilize them for maximum efficiency. We will discuss the future of CPU caches in a many-core world, as well as advancements that will soon arrive such as HP's Memristor.
The traditional topics of memory pressure, page life expectancy and memory grants have been covered to the point of saturation in the SQL community, in this deck I want to cover some topics relating to memory and SQL Server which might be considered "Off-piste" but are as equally relevant if not more so in terms of getting the best possible performance out of SQL Server.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1N4GN6z.
Brendan Gregg focuses on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive. Gregg includes advice and methodologies for verifying new performance tools, understanding how they work, and using them successfully. Filmed at qconsf.com.
Brendan Gregg is a senior performance architect at Netflix, where he does large scale computer performance design, analysis, and tuning. He is the author of multiple technical books including Systems Performance published by Prentice Hall, and received the USENIX LISA Award for Outstanding Achievement in System Administration.
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in 52 nodes at a rate of 0.916 TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 60% better in absolute performance and has over six times the per-node efficiency of the previous record holder. In this paper, we describe the hardware and software architecture necessary to operate TritonSort at this level of efficiency. Through careful management of system resources to ensure cross-resource balance, we are able to sort data at approximately 80% of the disks' aggregate sequential write speed. We believe the work holds a number of lessons for balanced system design and for scale-out architectures in general. While many interesting systems are able to scale linearly with additional servers, per-server performance can lag behind per-server capacity by more than an order of magnitude. Bridging the gap between high scalability and high performance would enable either significantly cheaper systems that are able to do the same work or provide the ability to address significantly larger problem sets with the same infrastructure.
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
Talk for MacIT 2014. This talk is about systems performance on OS X, and introduces the USE Method to check for common performance bottlenecks and errors. This methodology can be used by beginners and experts alike, and begins by constructing a checklist of the questions we’d like to ask of the system, before reaching for tools to answer them. The focus is resources: CPUs, GPUs, memory capacity, network interfaces, storage devices, controllers, interconnects, as well as some software resources such as mutex locks. These areas are investigated by a wide variety of tools, including vm_stat, iostat, netstat, top, latency, the DTrace scripts in /usr/bin (which were written by Brendan), custom DTrace scripts, Instruments, and more. This is a tour of the tools needed to solve our performance needs, rather than understanding tools just because they exist. This talk will make you aware of many areas of OS X that you can investigate, which will be especially useful for the time when you need to get to the bottom of a performance issue.
OpenStack: Virtual Routers On Compute Nodesclayton_oneill
Learn the production pros and cons of operating Neutron legacy and HA routers on compute nodes in your production cloud. Not ready for DVR or third-party network overhauls? Virtual router network “hot spots” got you down? Large virtual router failure domains keeping you up late at night? Neutron reference architectures not providing a scalable routing solution? If you answered yes to any of these questions then this talk is for you.
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Docker, Inc.
Go is, without doubt, a great language for writing massively concurrent programs. Nevertheless, our experience running Go under extreme load shows that there comes a point where assumptions and decisions made in Go runtime bite back on its users and lead to inferior performance, especially in high-throughput & high-load applications. This talk covers main reasons for this to happen and explores an interesting way to work around this issue: automatic local sharding with Docker.
Using Docker, local load-balancer and creativity, we can automatically shard & pin our apps in such a way so that the external observer (client, another microservice) would never see any difference. The result is that apps run faster, resource utilization is better, engineers are not frustrated when their Go suddenly breaks down and runs slow because they have a solution!
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
2. Automatic NUMA Balancing Agenda
•What is NUMA, anyway?
•Automatic NUMA balancing internals
•Automatic NUMA balancing performance
•What workloads benefit from manual NUMA tuning
•Future developments
•Conclusions
4. What is NUMA, anyway?
•Non Uniform Memory Access
•Multiple physical CPUs in a system
•Each CPU has memory attached to it
•Local memory, fast
•Each CPU can access other CPU's memory, too
•Remote memory, slower
5. NUMA terminology
•Node
•A physical CPU and attached memory
•Could be multiple CPUs (with off-chip memory controller)
•Interconnect
•Bus connecting the various nodes together
•Generally faster than memory bandwidth of a single node
•Can get overwhelmed by traffic from many nodes
8. NUMA performance considerations
•NUMA performance penalties from two main sources
•Higher latency of accessing remote memory
•Interconnect contention
•Processor threads and cores share resources
•Execution units (between HT threads)
•Cache (between threads and cores)
9. Automatic NUMA balancing strategies
•CPU follows memory
•Try running tasks where their memory is
•Memory follows CPU
•Move memory to where it is accessed
•Both strategies are used by automatic NUMA balancing
•Various mechanisms involved
•Lots of interesting corner cases...
12. NUMA hinting page faults
•Periodically, each task's memory is unmapped
•Period based on run time, and NUMA locality
•Unmapped “a little bit” at a time (chunks of 256MB)
•Page table set to “no access permission” marked as NUMA pte
•Page faults generated as task tries to access memory
•Used to track the location of memory a task uses
• Task may also have unused memory “just sitting around”
•NUMA faults also drive NUMA page migration
13. NUMA page migration
•NUMA page faults are relatively cheap
•Page migration is much more expensive
•... but so is having task memory on the “wrong node”
•Quadratic filter: only migrate if page is accessed twice
•From same NUMA node, or
•By the same task
•CPU number & low bits of pid in page struct
•Page is migrated to where the task is running
14. Fault statistics
•Fault statistics are used to place tasks (cpu-follows-memory)
•Statistics kept per task, and per numa_group
•“Where is the memory this task (or group) is accessing?”
•“NUMA page faults” counter per NUMA node
•After a NUMA fault, account the page location
• If the page was migrated, account the new location
•Kept as a floating average
15. Types of NUMA faults
•Locality
•“Local fault” - memory on same node as CPU
•“Remote fault” - memory on different node than CPU
•Private vs shared
•“Private fault” - memory accessed by same task twice in a row
•“Shared fault” - memory accessed by different task than last time
18. Task placement
•Best place to run a task
•Where most of its memory accesses happen
•It is not that simple
•Tasks may share memory
• Some private accesses, some shared accesses
• 60% private, 40% shared is possible – group tasks together for best performance
•Tasks with memory on the node may have more threads than can run in
one node's CPU cores
•Load balancer may have spread threads across more physical CPUs
• Take advantage of more CPU cache
19. Task placement constraints
•NUMA task placement may not create a load imbalance
•The load balancer would move something else
•Conflict can lead to tasks “bouncing around the system”
• Bad locality
• Lots of NUMA page migrations
•NUMA task placement may
•Swap tasks between nodes
•Move a task to an idle CPU if no imbalance is created
20. Task placement algorithm
•For task A, check each NUMA node N
•Check whether node N is better than task A's current node (C)
• Task A has a larger fraction of memory accesses on node N, than on current node C
• Score is the difference of fractions
•If so, check all CPUs on node N
• Is the current task (T) on CPU better off on node C?
• Is the CPU idle, and can we move task A to the CPU?
• Is the benefit of moving task A to node N larger than the downside of moving task T
to node C?
•For the CPU with the best score, move task A (and task T, to node C).
21. Task placement examples
NODE CPU TASK
0 0 A
0 1 T
1 2 (idle)
1 3 (idle)
Fault
statistics
TASK A TASK T
NODE 0 30% (*) 60% (*)
NODE 1 70% 40%
•Moving task A to node 1: 40% improvement
•Moving a task to node 1 removes a load imbalance
•Moving task A to an idle CPU on node 1 is desirable
22. Task placement examples
NODE CPU TASK
0 0 A
0 1 (idle)
1 2 T
1 3 (idle)
Fault
statistics
TASK A TASK T
NODE 0 30% (*) 60%
NODE 1 70% 40% (*)
•Moving task A to node 1: 40% improvement
•Moving task T to node 0: 20% improvement
•Swapping tasks A & T is desirable
23. Task placement examples
NODE CPU TASK
0 0 A
0 1 (idle)
1 2 T
1 3 (idle)
Fault
statistics
TASK A TASK T
NODE 0 30% (*) 40%
NODE 1 70% 60% (*)
•Moving task A to node 1: 40% improvement
•Moving task T to node 0: 20% worse
•Swapping tasks A & T: overall a 20% improvement, do it
24. Task placement examples
NODE CPU TASK
0 0 A
0 1 (idle)
1 2 T
1 3 (idle)
Fault
statistics
TASK A TASK T
NODE 0 30% (*) 20%
NODE 1 70% 80% (*)
•Moving task A to node 1: 40% improvement
•Moving task T to node 0: 60% worse
•Swapping tasks A & T: overall 20% worse, leave things alone
25. Task grouping
•Multiple tasks can access the same memory
•Threads in a large multi-threaded process (JVM, virtual machine, ...)
•Processes using shared memory segment (eg. Database)
•Use CPU num & pid in struct page to detect shared memory
•At NUMA fault time, check CPU where page was last faulted
•Group tasks together in numa_group, if PID matches
•Grouping related tasks improves NUMA task placement
•Only group truly related tasks
•Only group on write faults, ignore shared libraries like libc.so
26. Task grouping & task placement
•Group stats are the sum of the NUMA fault stats for tasks in group
•Task placement code similar to before
•If a task belongs to a numa_group, use the numa_group stats for
comparison instead of the task stats
•Pulls groups together, for more efficient access to shared memory
•When both compared tasks belong to the same numa_group
•Use task stats, since group numbers are the same
•Efficient placement of tasks within a group
29. Pseudo-interleaving
•Sometimes one workload spans multiple nodes
•More threads running than one node has CPU cores
•Spread out by the load balancer
•Goals
•Maximize memory bandwidth available to workload
•Keep private memory local to tasks using it
•Minimize number of page migrations
31. Pseudo-interleaving algorithm
•Determine nodes where workload is actively running
•CPU time used & NUMA faults
•Always allow private faults (same task) to migrate pages
•Allow shared faults to migrate pages only from a more heavily
used node, to a less heavily used node
•Block NUMA page migration on shared faults from one node to
another node that is equally or more heavily used
35. Evaluation of Automatic NUMA balancing – Status update
• Goal : Study the impact on out-of-box performance with different workloads on
bare-metal & KVM guests.
• Workloads being used*:
• 2 Java workloads
• SPECjbb2005 used as a workload
• Multi-JVM server workload
• Database
• A synthetic DSS workload (using tmpfs)
• Other DB workloads (with I/O) are in the pipeline...
* Note: These sample workloads are being used for relative performance comparisons. This is not an official benchmarking exercise!
36. Experiments with bare-metal
• Platforms used :
• 4-socket Ivy Bridge EX server
• 8-socket Ivy Bridge EX prototype server.
• Misc. settings:
• Hyper-threading off, THP enabled & cstates set to 1 (i.e. intel_idle.max_cstate=1 processor.max_cstate=1)
• Configurations :
• Baseline :
• No manual pinning of the workload
• No Automatic NUMA balancing
• Pinned :
• Manual (numactl) pinning of the workload
• Automatic NUMA balancing :
• No manual pinning of the workload.
37. SPECjbb2005 - bare-metal
(4-socket IVY-EX server vs. 8-socket IVY-EX prototype server)
4-1s wide 2-2s wide 1-4s wide
baseline
pinned
Automatic NUMA bal.
# of instances - socket width of each instance
Averageoperationspersecond
8-1s wide 4-2s wide 2-4s wide 1-8s wide
baseline
pinned
Automatic NUMA bal.
# instances - socket width of each instance
Averageoperationspersecond
Delta between Automatic NUMA balancing case &
the Pinned case was as high as ~20+%
Automatic NUMA balancing case &
the Pinned case were pretty close (+/- 4%).
1s wide = 15 warehouse threads, 2s wide = 30 warehouse threads; 4s wide = 60 warehouse threads, 8s wide = 120 warehouse threads
38. Remote vs. local memory access (RMA/LMA samples)*
(Workload : Multiple 1 socket-wide instances of SPECjbb2005)
4-socket IVY-EX server 8-socket IVY-EX prototype server
Baseline Baseline
Pinned
Pinned
Pinned
Automatic NUMA balancing
Automatic NUMA balancing
* Courtesy numatop v1.0.2 Higher RMA/LMA
39. Multi-JVM server workload – bare-metal
(4-socket IVY-EX server vs. 8-socket IVY-EX prototype server)
1 group/1socket 2 groups/2sockets 4 groups/4sockets
baseline - max OPS
pinned - max-OPS
Automatic NUMA bal - max-OPS
baseline - critical OPS
pinned - critical OPS
Automatic NUMA bal. - critical OPS
# of groups/# of sockets
#ofoperationspersecond
1 group/1socket 2 groups/2sockets 4 groups/4sockets 8 groups/8sockets
baseline - max OPS
pinned - max OPS
Automatic NUMA bal. - max OPS
baseline - critical OPS
pinned - critical OPS
Automatic NUMA bal. - critical OPS
# of groups/# of sockets
#ofoperationspersecond
Entities within each of the multiple Groups communicate with a Controller (using IPC) within the same host &
the frequency of communication increases as the # of Groups increase
Two key metrics : Max throughput (max-OPS) with no constraints & Critical throughput (critical-OPS) under fixed SLA constraints
Some workloads will still need manual pinning !
40. Database workload - bare-metal
(4-socket IVY-EX server)
100 users 200 users 300 users 400 users
Synthetic DSS workload (using tmpfs)
10GB Database size
Automatic NUMA bal.
100 users 200 users 300 users 400 users
# of users
Avg.#oftransactionspersecond
Static 2M huge pages used for SGA
~9-18% improvement in Average transactions per second with Automatic NUMA balancing
41. KVM guests
• Size of the guests continue to increase
• Use cases include classic enterprise scale-up guests & guests in private cloud environments.
• [Manual] pinning/tuning of individual guests using libvirt/virsh (after taking into account host's NUMA
topology) are required to achieve low overhead & predictable performance for workloads running in
mid/large sized guests. This is especially true on larger scale-up hosts (e.g. >= 4 socket servers)
• … but, its harder to live migrate such pinned guest across hosts – as similar set of backing resources may
not be available on the target host (or) the target host may have a different NUMA topology !
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
…
…
…
<vcpupin vcpu='29'
cpuset='29'/>
</cputune>
<numatune>
<memory mode='preferred' nodeset='0-1'/>
</numatune>
42. KVM guests (contd.)
• Automatic NUMA balancing could help avoid the need for having to manually pin the guest resources
• Exceptions include cases where a guest's backing memory pages on the host can't be migrated :
• Guests relying on hugetlbfs (instead of THPs)
• Guests with direct device assignment (get_user_pages())
• Guests with real time needs (mlock_all()).
• As the guest size spans more than 1 socket it is highly recommended to enable Virtual NUMA nodes in
the guest => helps the guest OS instance to scale/perform.
• Virtual NUMA nodes in the guest OS => Automatic NUMA balancing enabled in the guest OS instance.
• Users can still choose to pin the workload to these Virtual NUMA nodes, if it helps their use case.
<cpu>
<topology sockets='2' cores='15' threads='1'/>
<numa>
<cell cpus='0-14' memory='134217728'/>
<cell cpus='15-29' memory='134217728'/>
</numa>
</cpu>
43. Experiments with KVM guests
• Platform used : 4-socket Ivy Bridge EX server
• Guest sizes:
• 1s-wide guest → 15VCPUs/128GB
• 2s-wide guest → 30VCPUs/256GB (2 virtual NUMA nodes)
• 4s-wide guest → 60VCPUs/512GB (4 virtual NUMA nodes)
• Configurations tested: (different possible permutations, but we chose to focus on the following three)
• Baseline guest (=> a typical public/private cloud guest today)
• No pinning of VCPUs or memory,
• No Virtual NUMA nodes enabled in the guest (even for >1s wide guest).
• No Automatic NUMA balancing in Host or Guest OS.
• Workload not pinned in the guest.
• Pinned guest (=> a typical enterprise scale-up guest today)
• VCPUs and memory pinned
• Virtual NUMA nodes enabled in the guest (for >1s wide guest),
• Workload is pinned in the guest (to the virtual NUMA nodes).
• Automatic NUMA bal. guest : (=> “out of box” for any type of guest)
• No manual pinning of VCPUs or memory,
• Virtual NUMA nodes enabled in the guest (for >1s wide guest),
• Automatic NUMA balancing in Host and Guest OS.
• Workload not pinned in the guest. [a user could choose to pin a workload to the virtual NUMA nodes...without tripping over live migration issues.]
44. SPECjbb2005 in KVM guests
(4 socket IVY-EX server)
4 guests/1s wide 2 guests/2s wide 1 guest/1s wide
baseline guest
pinned guest
Automatic NUMA balancing
#of guests / socket width of the guest
Averageoperationspersecond
Automatic NUMA balancing case &
the Pinned guest case were pretty close (+/- 3%).
45. Multi-JVM server workload – KVM guests
(4-socket IVY-EX server)
1 group / each of 4 - 1 socket-wide guests 2 groups /each of 2 - 2 socket wide guests 4 groups / 1 - 4 socket wide guest
Baseline guest -max OPS
pinned guest-max OPS
Automatic NUMA bal. guest - max OPS
Baseline guest - critical OPS
pinned guest - critical OPS
Automatic bal. guest - critical OPS
# of groups in each of the guests
Averageofmax/criticaloperationspersecond(OPS)
Delta between the Automatic NUMA
balancing guest case &
the Pinned case was much higher
(~14% max-OPS and ~24% of critical-OPS)
Pinning the workload to the virtual NUMA nodes does help
46. KVM – server consolidation example
(Two 2-socket wide (30VCPU/256GB) guests each running a different workload hosted on 4 Socket IVY-EX server)
100 200 400
Sythetic DSS workload (using tmpfs)
10GB Databse size
baseline guest
pinned guest
Automatic NUMA bal. guest
# of users
AverageTransactionspersecond
30 warehouses
SPECjbb2005
baseline guest
pinned guest
Automatic NUMA bal. guest
# of warehouse threads
OperationspersecondAutomatic NUMA balancing case & the Pinned case were pretty close (~ 1-2%).
48. NUMA balancing future considerations
•Complex NUMA topologies & pseudo-interleaving
•Unmovable memory
•KSM
•Interrupt locality
•Inter Process Communication
49. Complex NUMA topologies & pseudo-interleaving
•Differing distances between NUMA nodes
•Local node, nearby nodes, far away nodes
•Eg. 20% & 100% performance penalty for nearby vs. far away
•Workloads that are spread across multiple nodes work better
when those nodes are near each other
•Unclear how to implement this
•When is the second-best node no longer the second best?
50. NUMA balancing & unmovable memory
•Unmovable memory
•Mlock
•Hugetlbfs
•Pinning for KVM device assignment & RDMA
•Memory is not movable ...
•But the tasks are
•NUMA faults would help move the task near the memory
•Unclear if worthwhile, needs experimentation
51. KSM
•Kernel Samepage Merging
•De-duplicates identical content between KVM guests
•Also usable by other programs
•KSM has simple NUMA option
•“Only merge between tasks on the same NUMA node”
•Task can be moved after memory is merged
•May need NUMA faults on KSM pages, and re-locate memory if needed
•Unclear if worthwhile, needs experimentation
52. Interrupt locality
•Some tasks do a lot of IO
•Performance benefit to placing those tasks near the IO device
•Manual binding demonstrates that benefit
•Currently automatic NUMA balancing only does memory & CPU
•Would need to be enhanced to consider IRQ/device locality
•Unclear how to implement this
•When should IRQ affinity outweigh CPU/memory affinity?
53. Inter Process Communication
•Some tasks communicate a LOT with other tasks
•Benefit from NUMA placement near tasks they communicate with
•Do not necessarily share any memory
•Loopback TCP/UDP, named socket, pipe, ...
•Unclear how to implement this
•How to weigh against CPU/memory locality?
54. Conclusions
•NUMA systems are increasingly common
•Automatic NUMA balancing improves performance for many
workloads, on many kinds of systems
•Can often get within a few % of optimal manual tuning
•Manual tuning can still help with certain workloads
•Future improvements may expand the range of workloads and
systems where automatic NUMA placement works nearly optimal