High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
In the networking world there are a number of ways to increase performance over naive use of basic Berkeley sockets. These techniques have ranged from polling blocking sockets, non-blocking sockets controlled by Epoll, all the way through completely bypassing the Linux kernel for maximum network performance where you talk directly to the network interface card by using something like DPDK or Netmap. All these tools have their place, and generally occupy a space from convenience to performance. But in recent years, that landscape has changed massively.. The tools available to the average Linux systems developer have improved from the creation of io_uring, to the expansion of bpf from a simple filtering language to a full-on programming environment embedded directly in the kernel. Along with that came something called XDP (express datapath). This was Linux kernel's answer to kernel-bypass networking. AF_XDP is the new socket type created by this feature, and generally works very similarly to something like DPDK. History lessons out of the way, this talk will look into, and discuss the merits of this technology, it's place in the broader ecosystem and how it can be used to attain the highest level of performance possible. This talk will dive into crucial details, such as how AF_XDP works, how it can be integrated into a larger system and finally more advanced topics such as request sharding/load balancing. There will be detailed look at the design of AF_XDP, the eBpf code used, as well as the userspace code required to drive it all. It will also include performance numbers from this setup compared to regular kernel networking. And most importantly how to put all this together to handle as much data as possible on a single modern multi-core system.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Virtual File System in Linux Kernel
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Video: https://www.facebook.com/atscaleevents/videos/1693888610884236/ . Talk by Brendan Gregg from Facebook's Performance @Scale: "Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter), aka, eBPF. It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix."
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
In the networking world there are a number of ways to increase performance over naive use of basic Berkeley sockets. These techniques have ranged from polling blocking sockets, non-blocking sockets controlled by Epoll, all the way through completely bypassing the Linux kernel for maximum network performance where you talk directly to the network interface card by using something like DPDK or Netmap. All these tools have their place, and generally occupy a space from convenience to performance. But in recent years, that landscape has changed massively.. The tools available to the average Linux systems developer have improved from the creation of io_uring, to the expansion of bpf from a simple filtering language to a full-on programming environment embedded directly in the kernel. Along with that came something called XDP (express datapath). This was Linux kernel's answer to kernel-bypass networking. AF_XDP is the new socket type created by this feature, and generally works very similarly to something like DPDK. History lessons out of the way, this talk will look into, and discuss the merits of this technology, it's place in the broader ecosystem and how it can be used to attain the highest level of performance possible. This talk will dive into crucial details, such as how AF_XDP works, how it can be integrated into a larger system and finally more advanced topics such as request sharding/load balancing. There will be detailed look at the design of AF_XDP, the eBpf code used, as well as the userspace code required to drive it all. It will also include performance numbers from this setup compared to regular kernel networking. And most importantly how to put all this together to handle as much data as possible on a single modern multi-core system.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
The Linux Block Layer - Built for Fast StorageKernel TLV
The arrival of flash storage introduced a radical change in performance profiles of direct attached devices. At the time, it was obvious that Linux I/O stack needed to be redesigned in order to support devices capable of millions of IOPs, and with extremely low latency.
In this talk we revisit the changes the Linux block layer in the
last decade or so, that made it what it is today - a performant, scalable, robust and NUMA-aware subsystem. In addition, we cover the new NVMe over Fabrics support in Linux.
Sagi Grimberg
Sagi is Principal Architect and co-founder at LightBits Labs.
Virtual File System in Linux Kernel
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Video: https://www.facebook.com/atscaleevents/videos/1693888610884236/ . Talk by Brendan Gregg from Facebook's Performance @Scale: "Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter), aka, eBPF. It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix."
USENIX LISA2021 talk by Brendan Gregg (https://www.youtube.com/watch?v=_5Z2AU7QTH4). This talk is a deep dive that describes how BPF (eBPF) works internally on Linux, and dissects some modern performance observability tools. Details covered include the kernel BPF implementation: the verifier, JIT compilation, and the BPF execution environment; the BPF instruction set; different event sources; and how BPF is used by user space, using bpftrace programs as an example. This includes showing how bpftrace is compiled to LLVM IR and then BPF bytecode, and how per-event data and aggregated map data are fetched from the kernel.
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Netronome's half-day tutorial on host data plane acceleration at ACM SIGCOMM 2018 introduced attendees to models for host data plane acceleration and provided an in-depth understanding of SmartNIC deployment models at hyperscale cloud vendors and telecom service providers.
Presenter Bios
Jakub Kicinski is a long term Linux kernel contributor, who has been leading the kernel team at Netronome for the last two years. Jakub’s major contributions include the creation of BPF hardware offload mechanisms in the kernel and bpftool user space utility, as well as work on the Linux kernel side of OVS offload.
David Beckett is a Software Engineer at Netronome with a strong technical background of computer networks including academic research with DDoS. David has expertise in the areas of Linux architecture and computer programming. David has a Masters Degree in Electrical, Electronic Engineering at Queen’s University Belfast and continues as a PhD student studying Emerging Application Layer DDoS threats.
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
This slide deck describes the Linux booting flow for x86_64 processors.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
Talk by for Velocity 2017 by Brendan Gregg: Performance analysis superpowers with Linux eBPF.
"Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will investigate this new technology, which sooner or later will be available to everyone who uses Linux. The talk will dive deep on these new tracing, observability, and debugging capabilities. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
eBPF is an exciting new technology that is poised to transform Linux performance engineering. eBPF enables users to dynamically and programatically trace any kernel or user space code path, safely and efficiently. However, understanding eBPF is not so simple. The goal of this talk is to give audiences a fundamental understanding of eBPF, how it interconnects existing Linux tracing technologies, and provides a powerful aplatform to solve any Linux performance problem.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Decompressed vmlinux: linux kernel initialization from page table configurati...Adrian Huang
Talk about how Linux kernel initializes the page table.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Video: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x ; This talk for SCaLE11x covers system performance analysis methodologies and the Linux tools to support them, so that you can get the most out of your systems and solve performance issues quickly. This includes a wide variety of tools, including basics like top(1), advanced tools like perf, and new tools like the DTrace for Linux prototypes.
USENIX LISA2021 talk by Brendan Gregg (https://www.youtube.com/watch?v=_5Z2AU7QTH4). This talk is a deep dive that describes how BPF (eBPF) works internally on Linux, and dissects some modern performance observability tools. Details covered include the kernel BPF implementation: the verifier, JIT compilation, and the BPF execution environment; the BPF instruction set; different event sources; and how BPF is used by user space, using bpftrace programs as an example. This includes showing how bpftrace is compiled to LLVM IR and then BPF bytecode, and how per-event data and aggregated map data are fetched from the kernel.
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Netronome's half-day tutorial on host data plane acceleration at ACM SIGCOMM 2018 introduced attendees to models for host data plane acceleration and provided an in-depth understanding of SmartNIC deployment models at hyperscale cloud vendors and telecom service providers.
Presenter Bios
Jakub Kicinski is a long term Linux kernel contributor, who has been leading the kernel team at Netronome for the last two years. Jakub’s major contributions include the creation of BPF hardware offload mechanisms in the kernel and bpftool user space utility, as well as work on the Linux kernel side of OVS offload.
David Beckett is a Software Engineer at Netronome with a strong technical background of computer networks including academic research with DDoS. David has expertise in the areas of Linux architecture and computer programming. David has a Masters Degree in Electrical, Electronic Engineering at Queen’s University Belfast and continues as a PhD student studying Emerging Application Layer DDoS threats.
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
This slide deck describes the Linux booting flow for x86_64 processors.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
Talk by for Velocity 2017 by Brendan Gregg: Performance analysis superpowers with Linux eBPF.
"Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will investigate this new technology, which sooner or later will be available to everyone who uses Linux. The talk will dive deep on these new tracing, observability, and debugging capabilities. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
eBPF is an exciting new technology that is poised to transform Linux performance engineering. eBPF enables users to dynamically and programatically trace any kernel or user space code path, safely and efficiently. However, understanding eBPF is not so simple. The goal of this talk is to give audiences a fundamental understanding of eBPF, how it interconnects existing Linux tracing technologies, and provides a powerful aplatform to solve any Linux performance problem.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Decompressed vmlinux: linux kernel initialization from page table configurati...Adrian Huang
Talk about how Linux kernel initializes the page table.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Video: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x ; This talk for SCaLE11x covers system performance analysis methodologies and the Linux tools to support them, so that you can get the most out of your systems and solve performance issues quickly. This includes a wide variety of tools, including basics like top(1), advanced tools like perf, and new tools like the DTrace for Linux prototypes.
Caches are used in many layers of applications that we develop today, holding data inside or outside of your runtime environment, or even distributed across multiple platforms in data fabrics. However, considerable performance gains can often be realized by configuring the deployment platform/environment and coding your application to take advantage of the properties of CPU caches.
In this talk, we will explore what CPU caches are, how they work and how to measure your JVM-based application data usage to utilize them for maximum efficiency. We will discuss the future of CPU caches in a many-core world, as well as advancements that will soon arrive such as HP's Memristor.
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
Nowadays system administrators have great choices when it comes down to Linux performance profiling and monitoring. The challenge is to pick the appropriate tools and interpret their results correctly.
This talk is a chance to take a tour through various performance profiling and benchmarking tools, focusing on their benefit for every sysadmin.
More than 25 different tools are presented. Ranging from well known tools like strace, iostat, tcpdump or vmstat to new features like Linux tracepoints or perf_events. You will also learn which tools can be monitored by Icinga and which monitoring plugins are already available for that.
At the end the goal is to gather reference points to look at, whenever you are faced with performance problems.
Take the chance to close your knowledge gaps and learn how to get the most out of your system.
OSDC 2017 | Open POWER for the data center by Werner FischerNETWAYS
IBM's POWER (Performance Optimization With Enhanced RISC) architecture is known to run mission-critical applications and to provide bank-style "RAS" (Reliability, Availability, Serviceability) features since 1990. Opening the architecture in 2013 enabled other vendors like Tyan or Rackspace to build servers based on the current POWER8 edition of this architecture. The current POWER8 CPUs provide up to 12 cores with 8x Simultaneous Multithreading - leading to 96 threads per CPU. Up to eight memory channels enable up to 230 GB/s memory bandwidth per CPU. Increased L1, L2, L3 and new L4 caches help to boost the performance of memory-bound applications like databeses, by providing more than 1 TB/s of bandwidth. In this talk Werner will give an overview of the architecture and show the performance possibilities of POWER8, using the PostgreSQL database as an example. By comparing PostgreSQL 9.4, 9.5 and 9.6 benchmarking results he will visualize the increased efficiency thanks to PowergreSQL's optimizations for POWER over the last years. Finally, he will outline one other benefit of OpenPOWER systems: from the very beginning (the first instruction to initialize the first CPU core, long before DRAM, firmware management or PCIe works) up to running your Linux OS and application like a database, only open source code gets executed.
OSDC 2017 - Werner Fischer - Open power for the data centerNETWAYS
IBM's POWER (Performance Optimization With Enhanced RISC) architecture is known to run mission-critical applications and to provide bank-style "RAS" (Reliability, Availability, Serviceability) features since 1990. Opening the architecture in 2013 enabled other vendors like Tyan or Rackspace to build servers based on the current POWER8 edition of this architecture. The current POWER8 CPUs provide up to 12 cores with 8x Simultaneous Multithreading - leading to 96 threads per CPU. Up to eight memory channels enable up to 230 GB/s memory bandwidth per CPU. Increased L1, L2, L3 and new L4 caches help to boost the performance of memory-bound applications like databeses, by providing more than 1 TB/s of bandwidth. In this talk Werner will give an overview of the architecture and show the performance possibilities of POWER8, using the PostgreSQL database as an example. By comparing PostgreSQL 9.4, 9.5 and 9.6 benchmarking results he will visualize the increased efficiency thanks to PowergreSQL's optimizations for POWER over the last years. Finally, he will outline one other benefit of OpenPOWER systems: from the very beginning (the first instruction to initialize the first CPU core, long before DRAM, firmware management or PCIe works) up to running your Linux OS and application like a database, only open source code gets executed.
This chapter contains information for memory compilers available in STDL80 cell library. These are
complete compilers that consist of various generators to satisfy the requirements of the circuit at hand. Each
of the final building block, the physical layout, will be implemented as a stand-alone, densely packed,
pitch-matched array. Using this complex layout generator and adopting state-of-the-art logic and circuit
design technique, these memory cells can realize extreme density and performance. In each layout
generator, we added an option which makes the aspect ratio of the physical layout selectable so that the
ASIC designers can choose the aspect ratio according to the convenience of the chip level layout.
Kernel vulnerabilities was commonly used to obtain admin privileges, and main rule was to stay in kernel as small time as possible! But nowdays even when you get admin / root then current operating systems are sometimes too restrictive. And that made kernel exploitation nice vector for installing to kernel mode!
In this talk we will examine steps from CPL3 to CPL0, including some nice tricks, and we end up with developing kernel mode drivers.
With multicore systems becoming the norm, every programmer is being forced to deal with multi-CPU memory atomicity bugs: data races. Data-race bugs are some of the hardest bugs to find and fix, sometimes taking weeks on end, even for experts. There are very few tools to help here (mostly just academic implementations). The authors of this presentation are at the forefront of multicore Java technology-based systems and daily have to debug data races. They have a lot of hard-won experiences with finding and fixing such bugs, and they share them with you in this presentation.
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
Caches are used in many layers of applications that we develop today, holding data inside or outside of your runtime environment, or even distributed across multiple platforms in data fabrics. However, considerable performance gains can often be realized by configuring the deployment platform/environment and coding your application to take advantage of the properties of CPU caches. In this talk, we will explore what CPU caches are, how they work and how to measure your JVM-based application data usage to utilize them for maximum efficiency. We will discuss the future of CPU caches in a many-core world, as well as advancements that will soon arrive such as HP's Memristor.
Similar to Cpu Cache and Memory Ordering——并发程序设计入门 (20)
MySQL 5.6 GA版本已经发布了,其中包含了大量的新特性,了解这些新特性,不仅对数据库内核研发有帮助,对于更好的使用MySQL数据库也有着极大的意义。本分享将深入剖析MySQL 5.6新特性的实现细节,一共分为两期:分别是InnoDB引擎以及MySQL Server。本次为第一期,分享 MySQL 5.6 InnoDB引擎中的性能优化与功能增强。
数据库内核分享,第一期“Buffer Pool Implementation InnoDB vs Oracle”的完整PPT,详细介绍了Buffer Pool在InnoDB与Oracle的实现,以及二者实现的不同之处。对朋友们理解两个数据库如何管理内存,有较大的帮助!注:此版本,彭立勋 同学做了部分注释,相对更易理解,谢谢立勋!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
6. CPU Cache
• What is a cache?
– Small, fast storage used to improve average access time to slow memory.
• Cache原理 (Memory Access Pattern)
– Spatial Locality
– Temporal Locality
7. Cache Hierarchy
• Multi-Level of Cache
– Nehalem (Three-Level)
• L1(Per-Core):32 KB D-Cache;32 KB I-Cache;
• L2(Per-Core):256 KB;
• L3(Shared): 8M;
• How to Test Cache Size?
– Igor’s Blog (Example 3)
– Pentium(R) Dual-Core CPU E5800(Two-Level)
• 本人PC机
• 32 KB L1 Data Cache;32 KB L1 Instruction Cache;
• 2 MB L2 Cache;(Unified Cache)
9. Cache Line
• The minimum amount of cache which can be loaded or stored
to memory
• X86 CPUs
– 64 Bytes
• ARM CPUs
– 32 Bytes
• Cache Line Size Testing
– Igor’s Blog:Example 2
12. Cache Structure
• Large caches are implemented as hardware hash tables with fixed-size hash
buckets (or “sets”) and no chaining.
• sets
– hardware hash tables中的hash入口数量;
• ways
– 每个hash入口能够存储的项数量;
• N-way set associative cache
– N = 1
• Direct-Mapped Cache
– N = 8
• 8-way set associative cache
– N = cache size / cache line size
• full associative cache
18. Cache Coherence Problem
• Assumption: Write back scheme
• Problem:
– Processors see different values for u after event 3
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u:5
1
u :5
2
u :5
3
u= 7
19. Cache Write Policy
• Write Back vs Write Through
– Write Back
• 脏数据,写出到Cache;
• Write Miss
– Read Cache Line;
– Write Allocate;
– Write Through
• 脏数据,写穿到Memory;
• Write Hit
– 更新Cache;
• Write Miss
– 绕过Cache,直接写memory;
27. Atomic Operation
• An operation acting on shared memory is atomic if it completes in a single step relative to other threads.
When an atomic store is performed on a shared variable, no other thread can observe the modification
half-complete. When an atomic load is performed on a shared variable, it reads the entire value as it
appeared at a single moment in time.
• Atomic Operation in CPU
– Intel CPU
– AMD CPU
36. Memory Ordering(Reordering)
• Reordering
– Reads and writes do not always happen in the order that you have written
them in your code.
• Why Reordering
– Performance
• Reordering Principle
– In single threaded programs from the programmer's point of view, all
operations appear to have been executed in the order specified, with
all inconsistencies hidden by hardware.
– 一段程序,在Reordering前,与Reordering后,拥有相同的执行效
果(Single Thread)
37. Reordering
• Examples
– Example 1
• A, B 赋值操作被Reorder;
– Example 2
• 假设X,Y初始化为0;
• Question:那么Load X,Y,会得到X,Y均为0吗?
• Test Code & Test Result
38. Reordering-Type
• Compiler Reordering
– Example 1,出现在编译期间的Reordering,称之为Compiler
Reordering;
• CPU Memory Ordering
– Example 2,出现在执行期间的Reordering,称之为CPU Memory
Ordering;
• 用户程序,无论是在编译期间,还是在执行期间,都会产
生Reordering;
39. Compiler Reordering & Compiler Memory Barrier
• Compiler Reordering能够提高程序的运行效率。但有时候 (尤其是针对Parallel
Programming),我们并不想让Compiler将我们的程序进行Reordering。此时,
就需要有一种机制,能够告诉Compiler,不要进行Reordering,这个机制,就
是Compiler Memory Barrier。
• Memory Barrier
– A memory barrier, is a type of barrier instruction which causes a central processing unit
(CPU) or compiler to enforce an ordering constraint on memory operations issued
before and after the barrier instruction. This typically means that certain operations are
guaranteed to be performed before the barrier, and others after.
• Compiler Memory Barrier
– 顾名思义,Complier Memory Barrier就是阻止Compiler进行Reordering的Barrier
Instruction;
42. CPU Memory Ordering
• Definition
– The term memory ordering refers to the order in which the processor
issues reads(loads) and writes(stores) through the system bus to
system memory. (From Intel System Programming Guide 8.2)
• Some Questions
– 为什么需要reordering?
• 1: L1 Latency 4 clks; L2 10 clks; L3 20 clks; Memory 200 clks Huge Latency
• 2: 考虑指令执行时,read与write的优先级;(CPU设计时,重点考虑)
– 有哪些Reordering情况?不同的CPU,支持哪些Reordering?
44. 扩展知识:CPU如何实现Memory Reordering
• Buffer and Queue
– Load/Store Buffer;Line Fill Buffer/Write Combining Buffer;Invalidate Message Queue;...
– 深入了解,见下面列出的参考资料
45. CPU Memory Models
• Definitions
– Memory consistency models describe how threads may interact through shared memory
consistently.
– There are many types of memory reordering, and not all types of reordering occur equally often. It
all depends on processor you’re targeting and/or the tool chain you’re using for development.
• 主要的CPU Memory Models
– Programming Order Stronger Memory Model
– Sequential Consistency
– Strict Consistency
– Data Dependency Order Weaker Memory Model
– ...
47. Intel X86/64 Memory Model(1)
• In a single-processor system for memory regions defined as write-back
cacheable.
– Reads are not reordered with other reads.
– Writes are not reordered with older reads.
– Writes to memory are not reordered with other writes.
– Reads may be reordered with older writes to different locations but not with older
writes to the same location.
– 注:以下部分,稍后分析
– Reads or writes cannot be reordered with I/O instructions, locked instructions, or
serializing instructions.
– Reads cannot pass earlier LFENCE and MFENCE instructions.
– Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
– LFENCE instructions cannot pass earlier reads.
– SFENCE instructions cannot pass earlier writes.
– MFENCE instructions cannot pass earlier reads or writes.
48. Intel X86/64 Memory Model(2)
• In a multiple-processor system
– Individual processors use the same ordering principles as in a single-processor
system.
– Writes by a single processor are observed in the same order by all processors.
– Writes from an individual processor are NOT ordered with respect to the
writes from other processors.
– Memory ordering obeys causality (memory ordering respects transitive
visibility).
– Any two stores are seen in a consistent order by processors other than those
performing the stores.
– 注:以下部分,稍后分析
– Locked instructions have a total order.
49. Intel X86/64 Memory Model(3)
• 解读
– 普通内存操作,只可能存在StoreLoad Reordering;
– LoadLoad、LoadStore、StoreStore均不可能Reordering;
– 一个Processor的Writes操作,其他Processor看到的顺序是一致的;
– 不同Processors的Writes操作,是没有顺序保证的;
• StoreLoad Reordering Problem
– Failure of Dekker’s algorithm
– Test Code
51. What About Other CPUs?
• So you know why we call X86, AMD64 as
Strong-Ordered.
52. How to Prevent CPU Memory Reordering
• Think about Compiler Memory Barrier
– asm volatile("" ::: "memory");
– __asm__ __volatile__ ("" ::: "memory");
• Memory Barrier Definition
– A memory barrier, is a type of barrier instruction which causes a central processing unit
(CPU) or compiler to enforce an ordering constraint on memory operations issued
before and after the barrier instruction. This typically means that certain operations are
guaranteed to be performed before the barrier, and others after.
• CPU Memory Barrier
– 顾名思义,Compiler Memory Barrier既然是用来告诉Compiler在编译阶段不要进行
指令乱排,那么CPU Memory Barrier就是用来告诉CPU,在执行阶段不要交互两条
操作内存的指令的顺序;
– 注意:由于CPU在执行时,必须感知到CPU Memory Barrier的存在,因此CPU
Memory Barrier是一条真正的指令,存在于编译后的汇编代码中;
54. Memory Barrier Instructions in CPU
• x86, x86-64, amd64
– lfence: Load Barrier
– sfence: Store Barrier
– mfence: Full Barrier
• PowerPC
– sync: Full Barrier
• MIPS
– sync: Full Barrier
• Itanium
– mf: Full Barrier
• ARMv7
– dmb
– dsb
– isb
55. Use CPU Memory Barrier Instructions(x86)
• Only CPU Memory Barrier
– asm volatile(“mfence”);
• CPU + Compiler Memory Barrier
– asm volatile(“mfence” ::: ”memory”);
• Use Memory Barrier in C/C++
56. Yes!We Need Lock Instruction’s Help!
• Question?
– 除了CPU本身提供的Memory Barrier指令之外,是否有其他的方式
实现Memory Barrier?
• Yes! We Need Lock Instruction’s Help!
– Reads or writes cannot be reordered with I/O instructions, locked
instructions, or serializing instructions.
– 解读
• 既然read/write不能穿越locked instructions进行reordering,那么所有带有lock
prefix的指令,都构成了一个天然的Full Memory Barrier;
57. Use Lock Instruction to Implement a MB
• lock addl
– asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
– addl $0,0(%%esp) do nothing
– lock; to be a cpu memory barrier
– “memory” to be a compiler memory barrier
• xchg
– asm volatile(“xchgl (%0),%0” ::: “memory”)
– Question: why xchg don’t need lock prefix?
– Answer: The LOCK prefix is automatically assumed for XCHG instruction.
• lock cmpxchg
– Do it yourself
59. X86 Memory Ordering with Memory Barrier(1)
• In a single-processor system for memory regions defined as write-back
cacheable.
– Reads are not reordered with other reads.
– Writes are not reordered with older reads.
– Writes to memory are not reordered with other writes.
– Reads may be reordered with older writes to different locations but not with older
writes to the same location.
– 注:新增部分
– Reads or writes cannot be reordered with I/O instructions, locked instructions, or
serializing instructions.
– Reads cannot pass earlier LFENCE and MFENCE instructions.
– Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
– LFENCE instructions cannot pass earlier reads.
– SFENCE instructions cannot pass earlier writes.
– MFENCE instructions cannot pass earlier reads or writes.
60. X86 Memory Ordering with Memory Barrier(2)
• In a multiple-processor system
– Individual processors use the same ordering principles as in a single-processor
system.
– Writes by a single processor are observed in the same order by all processors.
– Writes from an individual processor are NOT ordered with respect to the
writes from other processors.
– Memory ordering obeys causality (memory ordering respects transitive
visibility).
– Any two stores are seen in a consistent order by processors other than those
performing the stores.
– 注:新增部分
– Locked instructions have a total order.
61. Read Acquire vs Write Release(1)
• Read Acquire and Write Release
– Two Special Memory Barriers.
– Definition
• A read-acquire executes before all reads and writes
by the same thread that follow it in program order.
• A write-release executes after all reads and writes
by the same thread that precede it in program order.
• Question
– Read Acquire and Write Release 有何作用?
63. How to Implement Read Acquire/Write Release?
• Intel X86, X86-64
– Full Memory Barrier
• mfence
• locked instruction
• Compiler and OS
– Linux
• smp_mb()
– Windows
• Functions with Acquire/Release Semantics
• InterlockedIncrementAcquire ()...
64. Extension:StoreLoad Reorder
• Question
– 为什么Intel CPU在LoadLoad,LoadStore,StoreLoad,StoreStore乱序中,仅仅保持
了StoreLoad乱序?
– 为什么, LoadLoad/LoadStore/StoreStore Barrier乱序被称之为lightweight Barrier? 而
StoreLoad Barrier则为Expensive Barrier?
• on PowerPC, the lwsync (short for “lightweight sync”) instruction acts as all
three #LoadLoad, #LoadStore and #StoreStore barriers at the same time, yet is less expensive
than the sync instruction, which includes a #StoreLoad barrier.
• Answer
– Store Buffer;
– Store异步不影响指令执行;
– Load只能同步;
• 注意
– Intel CPU,Load自带Acquire Semantics;Store自带Release Semantics;
71. Spinlock:Active vs Passive
• Spinning.
• Active
– Only pause, not release CPU
• pause(); _mm_pause();
• Passive
– Release CPU to System, but not Sleep
• pthread_yield(); SwitchToThread(); Sleep(0);
– Release CPU to System, and Sleep
• Sleep(n);
• Hybrid
– Active + Passive
– 主流实现方式
72. Wrong Peterson’s Algorithm on X86
• Function
– 同步两个Threads;
– 同时只有一个Thread可加锁成功;
• Problem?
– lock()
• StoreLoad Reordering
• Store: _interested[me]
• Load: _interested[he]
– unlock()
• Compiler Reordering
76. Reference-综合
• Intel 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes:1, 2A, 2B, 2C, 3A, 3B,
and 3C
• AMD64 Architecture Programmers Manual Volume 1 System Programming
• AMD64 Architecture Programmers Manual Volume 2 System Programming
• MYTHBUSTING MODERN HARDWARE TO GAIN “MECHANICAL SYMPATHY”
• Performance Tuning for CPU(Marat Dukhan)
• Understanding The Linux Kernel 3rd Edition
• Working Draft, Standard for Programming Language C++
• The Art of Multiprocessor Programming
• Nehalem - Everything You Need to Know about Intel's New Architecture
• Intel Core i7 (Nehalem): Architecture By AMD?
77. Reference-CPU Cache
• Cache Coherence Protocols
• 高速缓存(Cache Memory)
• Cache(268 Pages)
• Cache: a place for concealment and safekeeping
• Gallery of Processor Cache Effects
• Getting Physical With Memory
• Intel’s Haswell CPU Microarchitecture
• Introduction of Cache Memory
• CPU Cache Flushing Fallacy
• Multiprocessor Cache Coherence
• Understanding the CPU Cache
• What Every Programmer Should Know About Memory - Akkadia.org
• What Programmer Should Know about Memory Consistence
78. Reference-Atomic
• An attempt to illustrate differences between memory ordering and atomic access
• Anatomy of Linux synchronization methods
• Atomic Builtins - Using the GNU Compiler Collection (GCC)
• Atomic vs. Non-Atomic Operations
• Understanding Atomic Operations
• Validating Memory Barriers and Atomic Instructions
79. Reference-Memory Ordering
• Acquire and Release Semantics
• An attempt to illustrate differences between memory ordering and atomic access
• what is a store buffer?
• Which is a better write barrier on x86: lock+addl or xchgl?
• Relative performance of swap vs compare-and-swap locks on x86
• difference in mfence and asm volatile (“” : : : “memory”)
• Inline Assembly
• Intel memory ordering, fence instructions, and atomic operations.
• Intel’s ‘cmpxchg’ instruction
• Lockless Programming Considerations for Xbox 360 and Microsoft Windows
• Write Combining
• Memory barriers: a hardware view for software hackers
• Memory Barriers Are Like Source Control Operations
• Memory Ordering at Compile Time
• Memory Reordering Caught in the Act
• Memory barriers - The Linux Kernel Archives
• Understanding Memory Ordering
• Weak vs. Strong Memory Models
• Who ordered memory fences on an x86?
80. Reference-Programming
• An Introduction to Lock-Free Programming
• Distributed Reader-Writer Mutex
• Effective Concurrency: Eliminate False Sharing
• False-sharing
• False Sharing
• x86 spinlock using cmpxchg
• SetThreadAffinityMask for unix systems
• Lock Free Algorithms - QCon London
• pause instruction in x86
• Per-processor Data
• Pointer Packing
• Spinlocks and Read-Write Locks
• Spinning
• What does “rep; nop;” mean in x86 assembly?