Presentation for following paper:
Jung, Hyungsoo, Hyuck Han, and Sooyong Kang. "Scalable database logging for multicores." Proceedings of the VLDB Endowment 11.2 (2017): 135-148.
BPF: Next Generation of Programmable DatapathThomas Graf
This session covers lessons learned while exploring BPF to provide a programmable datapath based on BPF and discusses options for OVS to leverage the technology.
In this talk we discuss the mechanisms of utilizing the eBPF language to perform hardware accelerated network packet manipulation and filtering. P4 programs can be compiled into eBPF scripts for offload in the Linux kernel using the Traffic Classifier (TC) subsystem. We demonstrate how, using eBPF as an intermediate language, it has been possible to extend the TC to either Just In Time (JIT) compile eBPF code to x86 assembler for software offload or to IXP byte code for execution in a trusted hardware environment within the Netronome Agilio intelligent server adapter. We finish by encouraging the audience to experiment with their own eBPF applications within the TC hardware accelerated system. The TC kernel patches are available on the Linux Kernel Networking mailing list as a Request For Comment (RFC) contribution.
Dinan Gunawardena, Director, Software Engineering, Netronome
Dinan Gunawardena is a Software Director focusing on running the driver team at Netronome. Previously, Dinan founded a software startup and was a Senior Research Engineer within the Operating Systems and Networking Group at Microsoft Research for 12 years, shipping technology in several versions of Microsoft Windows and the Bing Search Engine. Dinan has received over 20 patents and is a Chartered Software Engineer. Dinan has a Masters in Computer Science from University of Cambridge and a M.B.A. from WBS.
Jakub Kicinski, Software Engineering, Netronome
Jakub Kicinski is a Software Engineer specializing in the Linux Kernel drivers for Netronome SmartNICs. Jakub has previously worked as an intern for Intel Corporation. Jakub is also a researcher with expertise in Linux kernel. Experience in application development on complex multi-CPU and FPGA platforms. He is interested in high-performance software exploiting hardware capabilities and is passionate about networking. Jakub has a Masters in Computer Science from Gdansk University of Technology.
Cilium - Fast IPv6 Container Networking with BPF and XDPThomas Graf
We present a new open source project which provides IPv6 networking for Linux Containers by generating programs for each individual container on the fly and then runs them as JITed BPF code in the kernel. By generating and compiling the code, the program is reduced to the minimally required feature set and then heavily optimised by the compiler as parameters become plain variables. The upcoming addition of the Express Data Plane (XDP) to the kernel will make this approach even more efficient as the programs will get invoked directly from the network driver.
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIB Solutions
Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike.
This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for level 1/2/3 personnel.
BPF: Next Generation of Programmable DatapathThomas Graf
This session covers lessons learned while exploring BPF to provide a programmable datapath based on BPF and discusses options for OVS to leverage the technology.
In this talk we discuss the mechanisms of utilizing the eBPF language to perform hardware accelerated network packet manipulation and filtering. P4 programs can be compiled into eBPF scripts for offload in the Linux kernel using the Traffic Classifier (TC) subsystem. We demonstrate how, using eBPF as an intermediate language, it has been possible to extend the TC to either Just In Time (JIT) compile eBPF code to x86 assembler for software offload or to IXP byte code for execution in a trusted hardware environment within the Netronome Agilio intelligent server adapter. We finish by encouraging the audience to experiment with their own eBPF applications within the TC hardware accelerated system. The TC kernel patches are available on the Linux Kernel Networking mailing list as a Request For Comment (RFC) contribution.
Dinan Gunawardena, Director, Software Engineering, Netronome
Dinan Gunawardena is a Software Director focusing on running the driver team at Netronome. Previously, Dinan founded a software startup and was a Senior Research Engineer within the Operating Systems and Networking Group at Microsoft Research for 12 years, shipping technology in several versions of Microsoft Windows and the Bing Search Engine. Dinan has received over 20 patents and is a Chartered Software Engineer. Dinan has a Masters in Computer Science from University of Cambridge and a M.B.A. from WBS.
Jakub Kicinski, Software Engineering, Netronome
Jakub Kicinski is a Software Engineer specializing in the Linux Kernel drivers for Netronome SmartNICs. Jakub has previously worked as an intern for Intel Corporation. Jakub is also a researcher with expertise in Linux kernel. Experience in application development on complex multi-CPU and FPGA platforms. He is interested in high-performance software exploiting hardware capabilities and is passionate about networking. Jakub has a Masters in Computer Science from Gdansk University of Technology.
Cilium - Fast IPv6 Container Networking with BPF and XDPThomas Graf
We present a new open source project which provides IPv6 networking for Linux Containers by generating programs for each individual container on the fly and then runs them as JITed BPF code in the kernel. By generating and compiling the code, the program is reduced to the minimally required feature set and then heavily optimised by the compiler as parameters become plain variables. The upcoming addition of the Express Data Plane (XDP) to the kernel will make this approach even more efficient as the programs will get invoked directly from the network driver.
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIB Solutions
Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike.
This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for level 1/2/3 personnel.
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThomas Graf
The Linux packet filtering technology, iptables, has its roots in times when networking was relatively simple and network bandwidth was measured in mere megabits. Emerging technologies, such as distributed NAT, overlay networks and containers require enhanced functionality and additional flexibility. In parallel, the next generation of network cards with speeds of 40Gb and 100Gb will put additional pressure on performance.
In the upcoming Red Hat Enterprise Linux 7, a new dynamic firewall service, FirewallD, is planned to provide greater flexibility over iptables by eliminating service disruptions during rule updates, abstraction, and support for different network trust zones. Additionally, a new virtual machine-based packet filtering technology, nftables, addresses the functionality and flexibility requirements of modern network workloads.
In this session you’ll:
Deep dive into the newly introduced packet filtering capabilities of Red Hat Enterprise Linux 7 beta.
Learn best practices.
See the new set of configuration utilities that allow new optimization possibilities.
netfilter is a framework provided by the Linux kernel that allows various networking-related operations to be implemented in the form of customized handlers.
iptables is a user-space application program that allows a system administrator to configure the tables provided by the Linux kernel firewall (implemented as different netfilter modules) and the chains and rules it stores.
Many systems use iptables/netfilter, Linux's native packet filtering/mangling framework since Linux 2.4, be it home routers or sophisticated cloud network stacks.
In this session, we will talk about the netfilter framework and its facilities, explain how basic filtering and mangling use-cases are implemented using iptables, and introduce some less common but powerful extensions of iptables.
Shmulik Ladkani, Chief Architect at Nsof Networks.
Long time network veteran and kernel geek.
Shmulik started his career at Jungo (acquired by NDS/Cisco) implementing residential gateway software, focusing on embedded Linux, Linux kernel, networking and hardware/software integration.
Some billions of forwarded packets later, Shmulik left his position as Jungo's lead architect and joined Ravello Systems (acquired by Oracle) as tech lead, developing a virtual data center as a cloud-based service, focusing around virtualization systems, network virtualization and SDN.
Recently he co-founded Nsof Networks, where he's been busy architecting network infrastructure as a cloud-based service, gazing at internet routes in astonishment, and playing the chkuku.
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.
Tungsten Replicator is an innovative and reliable tool that can solve your most complex replication problems. In this webinar we will introduce Replicator installation and show you how to use key Replicator features effectively with MySQL.
Course Topics:
- Checking host and MySQL prerequisites
- Downloading code from http://code.google.com/p/tungsten-replicator/
- Installation using the tpm utility
- Transaction filtering using standard filters as well as customized filters you write yourself
- Enabling and managing parallel replication
- Configuring multi-master and fan-in using multiple replication services
- Backup and restore integration
- Troubleshooting replication problems
- Logging bugs and participating in the Tungsten Replicator community
Replication is a powerful technology that takes knowledge and planning to use effectively. This webinar gives you the background that makes replication easier to set up and allows you to take full advantage of the Tungsten Replicator benefits.
BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
Berkeley Packet Filter is an old friend for most people that deal with network under Linux. But its extended version eBPF is completely redefining the scope of usage and interaction with the kernel. It can indeed be used to instrument most parts of the kernel. This goes from network tracing to process or I/O monitoring.
This talk will provide an overview of eBPF, from concept to tools like BCC. It will then focus on XDP for eXtreme Data Path and the possible applications in term of networking provided by this new framework.
Eric Leblond, Stamus Network
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
This presentation features a walk through the Linux kernel networking stack for users and developers. It will cover insights into both, existing essential networking features and recent developments and will show how to use them properly. Our starting point is the network card driver as it feeds a packet into the stack. We will follow the packet as it traverses through various subsystems such as packet filtering, routing, protocol stacks, and the socket layer. We will pause here and there to look into concepts such as networking namespaces, segmentation offloading, TCP small queues, and low latency polling and will discuss how to configure them.
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Continuent
Deployment of MySQL multi-master topologies with Tungsten Replicator has been constantly improving. Yet, earlier there were some heavy operations to sustain, and unfriendly commands to perform. The latest version of Tungsten Replicator delivers all the topologies of its predecessors, with an improved installation tool that cuts down the deployment time to half in simple topologies, and to 1/10th in complex ones. Now you can install master/slave, multi-master, fan-in, and star topologies in less than a minute.
But there is more. Thanks to a versatile Tungsten Replicator installation tool, you can define your own deployment on-the-fly, and get creative: you can have stars with satellites, all-masters with fan-in slaves, and other customized clusters.
We will also cover other enhancements in Tungsten Replicator 2.1.1, such as full integration with MySQL 5.6, enhanced output from administrative tools, a few more goodies.
Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.
Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.
In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to AnalyticsLinas Virbalas
Oracle is the most powerful DBMS in the world. However, Oracle's expensive and complex replication makes it difficult to build highly available applications or move data in real-time to data warehouses and popular databases like MySQL. In this webinar you will learn how Continuent Tungsten solves problems with Oracle replication at a fraction of the cost of other solutions and with less management overhead too – think "Oracle GoldenGate without the price tag!" We will demo constructing a highly available site using Oracle-to-Oracle replication. We will then show you how to replicate data in real time from Oracle to MySQL as well as load a data warehouse.
Network Measurement with P4 and C on Netronome AgilioOpen-NFP
Network measurement has been playing a crucial role in network operations since it cannot only detect the anomalies, but also facilitate traffic engineering. With the recent development of P4 language, network measurement is one of the data plane applications that can benefit from the programmability enabled by P4. However, P4 does not support general purpose language structures such as for-loop, and the if-statement can only be used in its control block, and it has only a limited set of primitive actions. Hence, the current P4 has its limitations to support complicated measurement functions. In this webinar, we implement and evaluate the Count-Min sketch (used for heavy hitter detection) using the combination of P4 and C on a Netronome NFP NIC. We plan to demonstrate the flexibility and performance of the design and the C plug-in feature of Netronome NFP.
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
Uncovering performance regressions in the TCP SACKs vulnerability fixes
In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause.
Speaker: Chris Stevens (Databricks)
Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI.
This presentation features a walk through the Linux kernel networking stack covering the essentials and recent developments a developer needs to know. Our starting point is the network card driver as it feeds a packet into the stack. We will follow the packet as it traverses through various subsystems such as packet filtering, routing, protocol stacks, and the socket layer. We will pause here and there to look into concepts such as segmentation offloading, TCP small queues, and low latency polling. We will cover APIs exposed by the kernel that go beyond use of write()/read() on sockets and will look into how they are implemented on the kernel side.
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
This talk will start with a deep dive and hands on examples of BPF, possibly the most promising low level technology to address challenges in application and network security, tracing, and visibility. We will discuss how BPF evolved from a simple bytecode language to filter raw sockets for tcpdump to the a JITable virtual machine capable of universally extending and instrumenting both the Linux kernel and user space applications. The introduction is followed by a concrete example of how the Cilium open source project applies BPF to solve networking, security, and load balancing for highly distributed applications. We will discuss and demonstrate how Cilium with the help of BPF can be combined with distributed system orchestration such as Docker to simplify security, operations, and troubleshooting of distributed applications.
A talk at Open vSwitch 2018 Fall Conference. OVN control plane scalability is critical in production. While the distributed control plane architecture is a big advantage, the distributed controller on each hypervisor became the first bottle neck for scaling. This talk is to share how we (eBay and the community) solved the problem with Incremental Processing - the idea, challenges, and performance improvement results.
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThomas Graf
The Linux packet filtering technology, iptables, has its roots in times when networking was relatively simple and network bandwidth was measured in mere megabits. Emerging technologies, such as distributed NAT, overlay networks and containers require enhanced functionality and additional flexibility. In parallel, the next generation of network cards with speeds of 40Gb and 100Gb will put additional pressure on performance.
In the upcoming Red Hat Enterprise Linux 7, a new dynamic firewall service, FirewallD, is planned to provide greater flexibility over iptables by eliminating service disruptions during rule updates, abstraction, and support for different network trust zones. Additionally, a new virtual machine-based packet filtering technology, nftables, addresses the functionality and flexibility requirements of modern network workloads.
In this session you’ll:
Deep dive into the newly introduced packet filtering capabilities of Red Hat Enterprise Linux 7 beta.
Learn best practices.
See the new set of configuration utilities that allow new optimization possibilities.
netfilter is a framework provided by the Linux kernel that allows various networking-related operations to be implemented in the form of customized handlers.
iptables is a user-space application program that allows a system administrator to configure the tables provided by the Linux kernel firewall (implemented as different netfilter modules) and the chains and rules it stores.
Many systems use iptables/netfilter, Linux's native packet filtering/mangling framework since Linux 2.4, be it home routers or sophisticated cloud network stacks.
In this session, we will talk about the netfilter framework and its facilities, explain how basic filtering and mangling use-cases are implemented using iptables, and introduce some less common but powerful extensions of iptables.
Shmulik Ladkani, Chief Architect at Nsof Networks.
Long time network veteran and kernel geek.
Shmulik started his career at Jungo (acquired by NDS/Cisco) implementing residential gateway software, focusing on embedded Linux, Linux kernel, networking and hardware/software integration.
Some billions of forwarded packets later, Shmulik left his position as Jungo's lead architect and joined Ravello Systems (acquired by Oracle) as tech lead, developing a virtual data center as a cloud-based service, focusing around virtualization systems, network virtualization and SDN.
Recently he co-founded Nsof Networks, where he's been busy architecting network infrastructure as a cloud-based service, gazing at internet routes in astonishment, and playing the chkuku.
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.
Tungsten Replicator is an innovative and reliable tool that can solve your most complex replication problems. In this webinar we will introduce Replicator installation and show you how to use key Replicator features effectively with MySQL.
Course Topics:
- Checking host and MySQL prerequisites
- Downloading code from http://code.google.com/p/tungsten-replicator/
- Installation using the tpm utility
- Transaction filtering using standard filters as well as customized filters you write yourself
- Enabling and managing parallel replication
- Configuring multi-master and fan-in using multiple replication services
- Backup and restore integration
- Troubleshooting replication problems
- Logging bugs and participating in the Tungsten Replicator community
Replication is a powerful technology that takes knowledge and planning to use effectively. This webinar gives you the background that makes replication easier to set up and allows you to take full advantage of the Tungsten Replicator benefits.
BPF of Berkeley Packet Filter mechanism was first introduced in linux in 1997 in version 2.1.75. It has seen a number of extensions of the years. Recently in versions 3.15 - 3.19 it received a major overhaul which drastically expanded it's applicability. This talk will cover how the instruction set looks today and why. It's architecture, capabilities, interface, just-in-time compilers. We will also talk about how it's being used in different areas of the kernel like tracing and networking and future plans.
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
Berkeley Packet Filter is an old friend for most people that deal with network under Linux. But its extended version eBPF is completely redefining the scope of usage and interaction with the kernel. It can indeed be used to instrument most parts of the kernel. This goes from network tracing to process or I/O monitoring.
This talk will provide an overview of eBPF, from concept to tools like BCC. It will then focus on XDP for eXtreme Data Path and the possible applications in term of networking provided by this new framework.
Eric Leblond, Stamus Network
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
This presentation features a walk through the Linux kernel networking stack for users and developers. It will cover insights into both, existing essential networking features and recent developments and will show how to use them properly. Our starting point is the network card driver as it feeds a packet into the stack. We will follow the packet as it traverses through various subsystems such as packet filtering, routing, protocol stacks, and the socket layer. We will pause here and there to look into concepts such as networking namespaces, segmentation offloading, TCP small queues, and low latency polling and will discuss how to configure them.
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Continuent
Deployment of MySQL multi-master topologies with Tungsten Replicator has been constantly improving. Yet, earlier there were some heavy operations to sustain, and unfriendly commands to perform. The latest version of Tungsten Replicator delivers all the topologies of its predecessors, with an improved installation tool that cuts down the deployment time to half in simple topologies, and to 1/10th in complex ones. Now you can install master/slave, multi-master, fan-in, and star topologies in less than a minute.
But there is more. Thanks to a versatile Tungsten Replicator installation tool, you can define your own deployment on-the-fly, and get creative: you can have stars with satellites, all-masters with fan-in slaves, and other customized clusters.
We will also cover other enhancements in Tungsten Replicator 2.1.1, such as full integration with MySQL 5.6, enhanced output from administrative tools, a few more goodies.
Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.
Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.
In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to AnalyticsLinas Virbalas
Oracle is the most powerful DBMS in the world. However, Oracle's expensive and complex replication makes it difficult to build highly available applications or move data in real-time to data warehouses and popular databases like MySQL. In this webinar you will learn how Continuent Tungsten solves problems with Oracle replication at a fraction of the cost of other solutions and with less management overhead too – think "Oracle GoldenGate without the price tag!" We will demo constructing a highly available site using Oracle-to-Oracle replication. We will then show you how to replicate data in real time from Oracle to MySQL as well as load a data warehouse.
Network Measurement with P4 and C on Netronome AgilioOpen-NFP
Network measurement has been playing a crucial role in network operations since it cannot only detect the anomalies, but also facilitate traffic engineering. With the recent development of P4 language, network measurement is one of the data plane applications that can benefit from the programmability enabled by P4. However, P4 does not support general purpose language structures such as for-loop, and the if-statement can only be used in its control block, and it has only a limited set of primitive actions. Hence, the current P4 has its limitations to support complicated measurement functions. In this webinar, we implement and evaluate the Count-Min sketch (used for heavy hitter detection) using the combination of P4 and C on a Netronome NFP NIC. We plan to demonstrate the flexibility and performance of the design and the C plug-in feature of Netronome NFP.
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
Uncovering performance regressions in the TCP SACKs vulnerability fixes
In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause.
Speaker: Chris Stevens (Databricks)
Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI.
This presentation features a walk through the Linux kernel networking stack covering the essentials and recent developments a developer needs to know. Our starting point is the network card driver as it feeds a packet into the stack. We will follow the packet as it traverses through various subsystems such as packet filtering, routing, protocol stacks, and the socket layer. We will pause here and there to look into concepts such as segmentation offloading, TCP small queues, and low latency polling. We will cover APIs exposed by the kernel that go beyond use of write()/read() on sockets and will look into how they are implemented on the kernel side.
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
This talk will start with a deep dive and hands on examples of BPF, possibly the most promising low level technology to address challenges in application and network security, tracing, and visibility. We will discuss how BPF evolved from a simple bytecode language to filter raw sockets for tcpdump to the a JITable virtual machine capable of universally extending and instrumenting both the Linux kernel and user space applications. The introduction is followed by a concrete example of how the Cilium open source project applies BPF to solve networking, security, and load balancing for highly distributed applications. We will discuss and demonstrate how Cilium with the help of BPF can be combined with distributed system orchestration such as Docker to simplify security, operations, and troubleshooting of distributed applications.
A talk at Open vSwitch 2018 Fall Conference. OVN control plane scalability is critical in production. While the distributed control plane architecture is a big advantage, the distributed controller on each hypervisor became the first bottle neck for scaling. This talk is to share how we (eBay and the community) solved the problem with Incremental Processing - the idea, challenges, and performance improvement results.
This chapter contains information for memory compilers available in STDL80 cell library. These are
complete compilers that consist of various generators to satisfy the requirements of the circuit at hand. Each
of the final building block, the physical layout, will be implemented as a stand-alone, densely packed,
pitch-matched array. Using this complex layout generator and adopting state-of-the-art logic and circuit
design technique, these memory cells can realize extreme density and performance. In each layout
generator, we added an option which makes the aspect ratio of the physical layout selectable so that the
ASIC designers can choose the aspect ratio according to the convenience of the chip level layout.
In this deck from ATPESC 2019, Jack Dongarra from UT Knoxville presents: Adaptive Linear Solvers and Eigensolvers.
"Success in large-scale scientific computations often depends on algorithm design. Even the fastest machine may prove to be inadequate if insufficient attention is paid to the way in which the computation is organized. We have used several problems from computational physics to illustrate the importance of good algorithms, and we offer some very general principles for designing algorithms. Two subthemes are, first, the strong connection between the algorithm and the architecture of the target machine; and second, the importance of non-numerical methods in scientific computations."
Watch the video: https://wp.me/p3RLHQ-lq3
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Facebook created a new storage engine called MyRocks to optimize space and write performance, and recently migrated both UDB (a database for social activities, and our biggest in production) and Facebook Messenger to MyRocks. In this session, Yoshinori Matsunobu of Facebook talks about the challenges, benefits and lessons learned by migrating these applications from InnoDB to MyRocks.
In-memory processing has started to become the norm in large scale data handling. This is aclose to the metal analysis of highly important but often neglected aspects of memory accesstimes and how it impacts big data and NoSQL technologies.We cover aspects such as the TLB, the Transparent Huge Pages, the QPI Link, Hyperthreading and the impact of virtualization on high-memory footprint applications. We present benchmarks of various technologies ranging from Cloudera’s Impala to Couchbase and how they are impacted by the underlying hardware.The key takeaway is a better understanding of how to size a cluster, how to choose a cloud provider and an instance type for big data and NoSQL workloads and why not every core or GB of RAM is created equal.
This presentation introduces Data Plane Development Kit overview and basics. It is a part of a Network Programming Series.
First, the presentation focuses on the network performance challenges on the modern systems by comparing modern CPUs with modern 10 Gbps ethernet links. Then it touches memory hierarchy and kernel bottlenecks.
The following part explains the main DPDK techniques, like polling, bursts, hugepages and multicore processing.
DPDK overview explains how is the DPDK application is being initialized and run, touches lockless queues (rte_ring), memory pools (rte_mempool), memory buffers (rte_mbuf), hashes (rte_hash), cuckoo hashing, longest prefix match library (rte_lpm), poll mode drivers (PMDs) and kernel NIC interface (KNI).
At the end, there are few DPDK performance tips.
Tags: access time, burst, cache, dpdk, driver, ethernet, hub, hugepage, ip, kernel, lcore, linux, memory, pmd, polling, rss, softswitch, switch, userspace, xeon
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we have observed that the scale of state that is managed by Flink in production is constantly growing. This development created new challenges for state management in Flink, in particular for state checkpointing, which is the core of Flink's fault tolerance mechanism. Two of the most important problems that we had to solve were the following: (i) how can we limit the duration and size of checkpoints to something that does not grow linearly in the size of the state and (ii) how can we take checkpoints without blocking the processing pipeline in the meantime? We have implemented incremental checkpoints to solve the first problem by checkpointing only the changes between checkpoints, instead of always recording the whole state. Asynchronous checkpoints address the second problem and enable Flink to continue processing concurrently to running checkpoints. In this talk, we will take a deep dive into the details of Flink's new checkpointing features. In particular, we will talk about the underlying datastructures, log-structured merge trees and copy-on-write hash tables, and how those building blocks are assembled and orchestrated to advance Flink's checkpointing.
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
Beyond the immediate schema changes supported in Scylla Open Source 5.0, learn how the Raft consensus infrastructure will enable radical new capabilities. Discover how it will enable more dynamic topology changes, tablets, immediate consistency, better and faster elasticity, and simplification to repair operations.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...ScyllaDB
Scylla strives to deliver high throughput at low, consistent latencies under any scenario. But in the field things can and do get slower than one would like. Some of those issues come from bad data modelling and anti-patterns. Some others from lack of resources and bad system configuration, and in rare cases even product malfunction.
But how to tell them apart? And once you do, how to understand how to fix your application or reconfigure your system? Scylla has a rich ecosystem of tools available to answer those questions and in this talk we’ll discuss the proper use of some of them and how to take advantage of each tool’s strength. We will discuss real examples using tools like CQL tracing, nodetool commands, the Scylla monitor and others.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
Seastore: Next Generation Backing Store for CephScyllaDB
Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...Hyo jeong Lee
This is a presentation for following paper:
Hyojeong Lee, et al. "An Efficient Garbage Collection in Java Virtual Machine via Swap I/O Optimization" (2019).
Paper_Design of Swap-aware Java Virtual Machine Garbage Collector PolicyHyo jeong Lee
This is a presentation for the following papers:
(1) Chen, Qichen. "SAGP: A Design of Swap Aware JVM GC Policy." Middleware’18 (2018).
(2) Lee Hyojeong, Heonyoung Yeom, and Yongseok Son. "Design of Swap-aware Java Virtual Machine Garbage Collector Policy." 한국정보과학회 학술발표논문집 (2018): 16-18.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
12. ▪ Synchronous I/O delay
Motivation: Architectural Issues (3)
12
DRAM
Central log buffer
Flush
HDD or NVM
Synchronous I/O Delay
1. Buffering log.
2. Flush log to storage.
3. Write data.
Thread 1
Transaction A
13. Summary
Motivation.
▪ Central log buffer limits the scalability of DB logging on multicore.
→ Parallel logging on multicore
Contribution.
▪ ELEDA (Express Logging Ensuring Durable Atomicity)
▪ Fast, scalable logging method for high performance transaction
systems with guaranteed atomicity and durability.
▪ With concurrent data structures that solves performance bottlenecks
in central log buffer.
▪ Implementation
▪ Plug ELEDA to WiredTiger and Shore-MT and evaluate performance
improvements.
▪ (ex) Transaction throughput improves by higher than ~ 3.9 million
Txn/s.
13
14. Design: Parallel Logging on Multicore, Grasshopper (1)
14
▪ Issues on Parallel Logging on Multicore
▪ Guarantee the sequentiality of each logs.
▪ Detect log holes.
▪ Concurrently,
▪ buffering logs.
▪ writing logs to durable storage.
15. ▪ Issues on Parallel Logging on Multicore
▪ Guarantee the sequentiality of each logs.
Design: Parallel Logging on Multicore, Grasshopper (1)
15
T1 T2 T3
Fetch_and_Add
LSN:1
LSN:2
LSN:3
(cf) LSN: Log Sequence Number
16. ▪ Issues on Parallel Logging on Multicore
▪ Detect log holes.
Design: Parallel Logging on Multicore, Grasshopper (1)
16
T1 T2 T3
hole
Fetch_and_Add
LSN:1
LSN:2
LSN:3
L1 L3
L2
SBL (cf) SBL: sequentially buffered LSN
17. ▪ Issues on Parallel Logging on Multicore
▪ Concurrently,
▪ buffering logs.
▪ writing logs to durable storage.
Design: Parallel Logging on Multicore, Grasshopper (1)
17
T1 T2 T3
holeL1 L3
L2
SBLFlush
18. Design: Parallel Logging on Multicore, Grasshopper (1)
18
▪ Issues on Parallel Logging on Multicore
▪ Guarantee the sequentiality of each logs.
▪ Detect log holes.
▪ Concurrently,
▪ buffering logs.
▪ writing logs to durable storage.
▪ So, design a concurrent data structure that satisfy,
▪ Concurrent buffering and flushing of logs,
▪ Fast log hole detection.
19. Design: Parallel Logging on Multicore, Grasshopper (2)
19
Thread type ELEDA-worker ELEDA-flusher Database
Data
structure
Global Central log buffer
Others
- Hopping index (R)
- C&H-list
- Min heap
⋅
- Hopping index (W)
- C&H-list
Operation - Tracking holes - Flush
- Copy log to buffer
- Garbage collection
20. Design: Parallel Logging on Multicore, Grasshopper (2)
▪ ELEDA logging architecture
20
DB thread
28. Design: Execution process of ELEDA-based system
28
Hopping
head
tail
Crawling
head
tail
1 4 7
Page 1 Page 2 Page 3
Hopping
head
tail
Crawling
head
tail
2 6
Page 1 Page 2
Hopping
head
tail
Crawling
head
tail
3 5
Page 1 Page 2
1
2 3
Min heap
Thread 1
Thread 2
Thread 3
29. Flusher
Worker
1. Get HB by scanning Hopping index table.
HB is 2 in this case.
2. Remove items that related with page
number 2 in c-list and h-list.
3. Rebuild min heap.
4. Pop root(7) in min heap.
5. Then, SBL is 7.
6. Flush LSN 1~7 to storage.
(cf)
- HB: hopping boundary
- SBL: sequentially buffered LSN
▪ Tracking LSN holes (= log holes) and flushing SBL
Design: Execution process of ELEDA-based system
29
[1] 4096 = DB page size
[2] 4096 = DB page size
[3] 4096 / 3 < DB page size
Hopping Index
HB
30. Design: Execution process of ELEDA-based system
30
Hopping
head
tail
Crawling
head
tail
7
Page 3
Hopping
head
tail
Crawling
head
tail
Hopping
head
tail
Crawling
head
tail
7
Thread 1
Thread 2
Thread 3
Pop
31. Flusher
Worker
Design: Execution process of ELEDA-based system
31
Hopping
head
tail
Crawling
head
tail
Hopping
head
tail
Crawling
head
tail
Hopping
head
tail
Crawling
head
tail
Thread 1
Thread 2
Thread 3
1. Get HB by scanning Hopping index table. HB
is 2 in this case.
2. Remove items that related with page number 2
in c-list and h-list.
3. Rebuild min heap.
4. Pop root(7) in min heap.
5. Then, SBL is 7.
6. Flush LSN 1~7 to storage.
(cf)
- HB: hopping boundary
- SBL: sequentially buffered LSN
32. Implementation
▪ Applying to kernel file system, such as ext4.
▪ Abstraction
32
Thread type ELEDA-worker ELEDA-flusher Database
Data
structure
Global Central log buffer
Others
- Hopping index (R)
- C&H-list
- Min heap
⋅
- Hopping index (W)
- C&H-list
Operation - Tracking holes - Flush
- Copy log to buffer
- Garbage collection
33. Implementation
▪ Shore-MT
▪ Implement ELEDA to Shore-MT with Aether.
(cf) Aether: A Scalable Approach to Logging, R.Johnson et al.
▪ Details
▪ Replace its consolidation array-based logging subsystem.
▪ Modify its flush pipelining implementation for transaction
switching.
33
34. Other issues (1)
▪ Flush
▪ I/O unit for flushing is experimentally tailored.
▪ It depends on characteristics of applications.
▪ Average size of logs
▪ Max concurrency
(cf) 6.5.3 Effects of I/O unit size (64KiB and 512KiB)
▪ Garbage Collection & Callback
▪ GC pointer is exclusively accessed by the owner DB thread.
34
35. Other issues (2)
▪ Partially sequential implementation
▪ Access of DB threads to Hopping index.
▪ Evaluation
▪ Throughput and Commit latency
▪ Workloads
▪ Key-value
▪ Online transaction processing
▪ with Different Settings by DB options
▪ CPU utilization and Effects of I/O unit size
35
36. Summary
Motivation.
▪ Central log buffer limits the scalability of DB logging on multicore.
→ Parallel logging on multicore using Grasshopper
Contribution.
▪ ELEDA (Express Logging Ensuring Durable Atomicity)
▪ Fast, scalable logging method for high performance transaction
systems with guaranteed atomicity and durability.
▪ With concurrent data structures that solves performance bottlenecks
in central log buffer.
▪ Implementation
▪ Plug ELEDA to WiredTiger and Shore-MT and evaluate performance
improvements.
▪ (ex) Transaction throughput improves by higher than ~ 3.9 million
Txn/s.
36
37. TODO
▪ Analyze Shore-MT and Aether.
▪ Where can I insert logging and flusher modules?
▪ Design the logging subsystem and flusher modules.
▪ Implement ELEDA to Shore-MT.
▪ Starting point is C&H-list.
37
39. Shore-MT and Aether
▪ Shore-MT
▪ Open-source multi-threaded storage manager.
▪ The authors use the EPFL branch of Shore-MT.
▪ Aether
▪ A scalable approach to logging.
▪ Details for implementation
▪ 4.1 Flush Pipelining → modified to ELEDA’s design
▪ A.1 Log buffer design
▪ A.2 Consolidation array → replaced with ELEDA’s design
▪ A.3 Modification to address a potential delays caused by the
requirement that all threads need to release their buffer in-order
39
pseudo
codes
exist.
40. Shore-MT
▪ Shore-MT and target for optimization
▪ Open-source multi-threaded storage manager.
▪ The authors use the EPFL branch of Shore-MT.
(cf) https://bitbucket.org/shoremt/shore-
mt/src/e832a6a586048ad3f4cdefde30cf96131d4b4525?at=default
▪ Language
▪ Cpp
▪ Related codes in src/sm/log.h & log.cpp
▪ Log manager class log_m
40
41. Aether
▪ Aether and TODO
▪ A scalable approach to logging.
▪ Details for implementation
▪ 4.1 Flush Pipelining → Modified to ELEDA’s design
▪ Related codes in src/sm/log_core.cpp
▪ Default flusher method
rc_t log_core::flush(lsn_t lsn, bool block)
▪ A.1 Log buffer design
▪ A.2 Consolidation array → Replaced with ELEDA’s design
▪ A.3 Modification to address a potential delays caused by the
requirement that all threads need to release their buffer in-order
▪ A.4 Difficulty of distributing the log
41
42. TODO
▪ Analyze Shore-MT and Aether.
▪ Shore-MT (default) → Aether → ELEDA
: Define what features (i.e. multi logging by DB threads) are
implemented in each systems.
▪ Find out which part of the ELEDA can be replaced by Flush
pipelining and Consolidation array of Aether.
▪ Design the logging subsystem and flusher modules.
▪ Implement ELEDA to Shore-MT.
▪ Starting point is C&H-list.
42
43. Reference
▪ Johnson, Ryan, et al. "Aether: a scalable approach to logging."
Proceedings of the VLDB Endowment 3.1-2 (2010): 681-692.
▪ Shore-MT (source code and docs), https://bitbucket.org/shoremt/
▪ Shore Storage Manager Modules,
http://research.cs.wisc.edu/shore-mt/onlinedoc/html/index.html
▪ Implementation notes of Log manager,
http://research.cs.wisc.edu/shore-
mt/onlinedoc/html/implnotes.html#LOG_M
43
Editor's Notes
크게는 논문의 motivation과 design, implementation 순서로 설명 드림.
먼저, 현재 상용화되어있는 데이터베이스의 주요 특징은 다음 두 가지임.
먼저, 기존의 WAL 프로토콜로 인한 문제를 짚음. 이것은 뒷장에서 자세히 설명함.
또, ACID property 중 본 논문에서는 본인들의 아이디어에서 Atomicity와 Durability 측면을 특히 강조함.
특히, Durability와 performance의 trade-off를 짚었음.
기존 DB들은 여기 표와 같이 performance(속도)가 느린 것을 해결하기 위해 durability를 포기하거나,
durability를 위해 속도를 조금 포기하는 옵션을 제공함.
(여기서 제시한 아이디어가 두 간극을 줄임)
다음은 앞서 언급한 WAL protocol에 대한 것임.
기존에는 WAL을 위해 다음 그림과 같은 아키텍처를 주로 사용함.
여기서 저자들은 두 가지 이슈를 제시함.
첫번째는 central log buffer 자체의 scalability, 두번째는 synchronous I/O delay임.
먼저, central log buffer의 scalability 문제임.
이 예시대로, central log buffer를 공유자원으로 두고 lock을 사용한 방식은
multicore 환경에서 성능적 한계가 있음.
먼저, central log buffer의 scalability 문제임.
먼저, central log buffer의 scalability 문제임.
먼저, central log buffer의 scalability 문제임.
먼저, central log buffer의 scalability 문제임.
먼저, central log buffer의 scalability 문제임.
먼저, central log buffer의 scalability 문제임.
다음은 synchronous I/O delay 문제임.
기존 WAL protocol에 의하면 thread는 log를 먼저 쓰고 storage에 flush한 뒤에야,
실제 데이터를 write하는 order를 지켜야함.
이 때, flush, 즉 disk I/O를 하느라 thread가 기다려야하는 시간을 synchronous I/O delay라고 함.
스토리지 디바이스로 HDD 대신 NVM을 쓰면 delay가 줄기는 하지만
이 논문에서 제안한 방법을 사용하면 더 성능이 좋다고 저자들은 이야기함.
여기까지의 내용을 정리하면, 현존하는 DB의 central log buffer 방식이
scalability에 있어서 한계가 있고,
synchronous I/O delay 등 성능상의 문제가 있다는 점을 모티베이션으로 했음.
저자들은 이를 해결하기 위해 멀티코어에서의 Parallel logging를 제안함.
그것을 ELEDA라고 하며,
기존 central log buffer의 성능적 병목을 해결하는 concurrent data structure를 활용해서
atomicity와 durability를 보장하면서도 성능이 좋은 트랜잭션 시스템이라고 이야기함.
성과를 간단히 언급하면,
WiredTiger와 Shore-MT라는 DB에 이를 적용해서 성능적 향상을 보았다고함.
예를들면, 가장 좋은 케이스의 경우 트랜잭션 throughput이 3백90만 Transaction per second까지 향상됨을 보였음.
이제 본격적으로 디자인에 대한 내용임.
저자들은 ELEDA의 핵심 디자인 기법을 Grasshopper라 이름지었음.
앞서 말한 멀티코어에서의 Parallel logging 기법을 구현하는 데에는
다음과 같은 세가지 정도의 이슈가 있음.
먼저, 각 로그의 시퀀셜리티를 보장해주어야함.
간단히 언급하면, ELEDA에서는 lock 대신
fetch and add operation을 활용해 transaction별로 고유한 번호를 부여했음.
이를 Log sequence number, LSN이라함.
central log buffer 앞 쪽부터 sequential한 log block들을 flush해주기 위해,
가장 처음 등장하는 hole이 어딨는지 빠르게 찾을 수 있어야함.
마지막으로, concurrent하게 log를 버퍼링하고,
또 버퍼된 로그를 스토리지에 flush할 수 있어야함.
정리하면, concurrent data structure와 기법으로,
concurrent한 버퍼링 및 플러싱과
빠른 hole detection을 가능케해야함.
이를 위해서 ELEDA에서 제안한 방식은 다음과 같음.
먼저, concurrent한 연산을 위해 thread를 세 가지 타입으로 정의함.
먼저 hole을 트래킹하는 worker,
worker가 찾아준 SBL 정보로 flush를 하는 flusher,
그리고 log를 버퍼에 copy하고 GC 등의 작업을 하는 기존 DB thread들임.
세 thread는 central log buffer를 global 자료구조로 보고,
worker는 hopping index table과 hopping, crawling list, min heap 세 자료구조를 사용함.
그리고 DB thread들은 thread 별로 crawling list, hopping list의 두 개의 list를 유지함.
이게 뭐하는 자료구조인지는 뒤에서 설명함.
앞서서 간단히 언급했던 parallel logging의 구현 이슈를 더 자세히 설명함.
DB thread가 transaction의 log를 buffer로 copy할 때에는
LSN(log sequence number)를 부여해서
각 thread가 독립적으로 서로 간섭 없이 (lock도 없이)
central buffer에서 자신의 위치를 찾아 copy함.
기존에는 flush할 때 hole이 있으면 log hole이 채워지는 것을 기다림.
multicore여도 한꺼번에 내려서 multi로 하는 이점이 없었음.
> algo (1) 버퍼 중간 중간에 연속된 로그가 생기면 flush하게 함.
central buffer에 lock을 쓸 수도 있지만 오버헤드가 크므로
안 쓰고 thread가 각각 fetch & add 연산 수행.
이 때 발생하는 hole을 효과적으로 detect하는 알고리즘을 바로 설명하겠음.
hole을 detect하는 작업은 앞서말했듯 worker thread가 수행함.
여기서 grasshopper 알고리즘이 등장함.
worker가 유지하는 자료구조 중 하나인 hopping index table에는
central log buffer를 2의 H승이라고 표현된 DB의 한 페이지 크기 단위로
잘라서 보았을 때에 한 단위 안에 들어있는 log들의 크기를 저장함.
저 테이블을 차례로 보면서 그 값이 2의 H승으로 꽉 차있다면 log를 page 단위로 hopping함.
그 값이 만약 2의 H승보다 작다면 hole이 있다는 의미이므로
그 페이지의 시작점부터는 log 단위로 크롤링함.
다음은 synchronous I/O delay에 대한 해법임.
> algo (2) 이런 latency를 hiding하는 asynchronous I/O 제안함.
thread가 만약 transaction 1 실행 중 log를 flush해야하는 경우가 생긴다면,
이걸 기다리지 않고 transaction 2로 context switching.
이 때 flush thread가 일을 다 끝내고 callback하면,
다시 trans 1로 돌아와서 실제 데이터 write 수행함.
다시 정리하면,
Thread type은 DB, worker, flusher.
DB thread들 각각은 crawling-list와 hopping-list 한 쌍을 유지.
각각 c, h list라고 줄여부름.
이 때 c-list는 LSN 단위로, h-list는 DB 페이지 단위로 유지.
worker는 min heap과 hopping index를 유지.
이 때 min heap은 c-list의 LSN의 최소들만 유지.
이것들이 실제로 쓰이는 과정을 설명하기 위해
앞에서 말한 기법을 data structure level에서 보여드리겠음.
간단한 예시로 다음과 같이 스레드 1, 2, 3별로
log가 central buffer에 저렇게 위치한다고 가정.
또, DB page size는 4KB로 가정
단어
SDL: storage durable LSN
SBL: sequentially buffered LSN
LSN: log sequence number
LSN hole: partially buffered log
이런 상황에 worker thread의 hopping index table은 다음과 같이 값을 가지게됨.
단어
SDL: storage durable LSN
SBL: sequentially buffered LSN
LSN: log sequence number
LSN hole: partially buffered log
동시에 각 thread의 h-list와 c-list는 다음과 같은 모습을 보임.
말한대로 h-list는 page를 관리하고, c-list는 LSN(log)를 관리함.
이 때에, 각 thread의 c-list에서 최소의 LSN을 Min heap 구조로 유지함.
여기에는 생략했지만 DB thread가 LSN을 추가할 때마다 hopping index table의 값에 + log size해주며 채워감.
이 때에 DB thread들의 hopping index에 대한 multi write는 시스템적으로 불허함.
이러한 상황에 worker thread가 어떻게 hole을 tracking하고
flusher thread에게 어떤 정보를 전달하는지 설명함.
먼저, worker는 hopping index table을 스캔해서 4KB가 되지 않는 곳의 인덱스에서 하나를 뺀
hopping boundary를 찾음. 즉, hole이 있는 페이지 구간의 직전 페이지 인덱스를 얻음.
여기서는 2.
그 후 page 2번과 관련된 모든 아이템들을 c-list와 h-list에서 지움.
그 후, 다시 heap을 build함.
이 3번까지의 작업을 끝내면 다음장과 같은 그림이 됨.
이렇게됨.
여기서 min을 pop하여 SBL의 정보를 얻음.
그 후, flusher는 LSN 1부터 7까지의 시퀀셜 로그를 스토리지에 플러시함.
여기까지가 ELEDA 및 grasshopper 알고리즘에 대한 주요 설명임.
이에 대한 구현을 어떻게 할지는 다음과 같이 생각함.
앞서 정리한 표에 따른 자료구조와 메소드를 구현할 것임.
이 때, 논문의 표에 따르면 각 자료구조의 포인터에 대한 스레드 세 개의 접근 권한을
달리해서 lock 없는 concurrent logging을 완성하였으므로
저 정보를 참고하여야할 것임.
특히, wiredtiger와 shore-mt 중 shore-mt에 대한 구현은 다음과 같이 언급함.
Aether를 활용해서 shore-mt에 ELEDA를 구현했음.
이 때, 기존의 array-based logging subsystem을 ELEDA의 디자인대로 교체했고,
transaction context switching을 위해서 flush pipelining을 구현했다고 함.
이 외에 GC와 flush에 대한 구현도 더 고민해보아야할 것임.
특히 flusher thread 구현 시 한 번 flush할 때의 I/O 단위에 대해서는
저자도 여지를 남겨둠. 논문에서는 64와 512KiB 두 경우에 대해 평가했으며,
여전히 latency와 bandwidth 간 trade-off가 있어서 잘 조절해야한다고 언급함.
(커뮤니티에 이 문제에 대해서 오픈 해놨다고 함.)
또한, hopping index를 DB thread가 접근할 때 시스템적으로 한 번에 하나씩 시퀀셜하게만 허가한다는 점 등,
concurrent design과 거리가 먼 것들이 추후 개선 대상이 될 수 있을 것이라 생각함.
그리고 evaluation의 카테고리는 다음과 같으며 일단은 생략하겠음
다시 정리하면, 현존하는 DB의 central log buffer 방식이
scalability에 있어서 한계가 있고,
synchronous I/O delay 등 성능상의 문제가 있다는 점을 모티베이션으로 했음.
저자들은 이를 해결하기 위해 멀티코어에서의 Parallel logging를 제안함.
그것을 ELEDA라고 하며,
기존 central log buffer의 성능적 병목을 해결하는 concurrent data structure를 활용해서
atomicity와 durability를 보장하면서도 성능이 좋은 트랜잭션 시스템이라고 이야기함.
성과를 간단히 언급하면,
WiredTiger와 Shore-MT라는 DB에 이를 적용해서 성능적 향상을 보았다고함.
예를들면, 가장 좋은 케이스의 경우 트랜잭션 throughput이 390만 Transaction per second까지 향상됨을 보였음.
앞으로 할 일을 대강 세 단계로 분류했음.
어떤 부분 수정이 필요할 지 예상해볼 필요가 있음.
사실 ELEDA 논문에서는 해당 버전의 shore-MT에 멀티코어 scalability를 제공하는
여러 기능(DB locking, latching, logging)이 이미 구현이 되어있다고 함.
따라서, 기존의 logging을 기반으로 optimization을 했다고 함.
최적화 파트는 밑줄로 표시된 부분으로 예상.
Aether에서 Flush pipelining, Consolidation array 각각의 내용 이해할 필요가 있음.
특히, consolidation array! flush pipelining은 flush thread와 worker thread 간 통신 구현해야할 것.
(SBL 정해지면 worker가 flusher에게 그 위치를 콜백으로 넘겨줘야함.)
Shore-MT(default) → Aether → ELEDA 각각이 어디까지 구현이 되어있는지 정의해야함.
(ex) multi logging by DB threads, … 저 consolidation array가 어떤 기능 의미하는지 파악.
즉, 각 부분이 ELEDA의 어떤 모듈과 치환될 수 있는지 정의해야함.