In this deck from the Swiss HPC Conference, Torsten Hoefler from ETH Zurich presents: RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects.
"Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly. We demonstrate how to efficiently implement the specification on modern RDMA networks. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads, comparable to, or better than UPC and CAF in terms of latency, bandwidth, and message rate. After this, we recognize that network cards contain rather powerful processors optimized for data movement and limiting the functionality to remote direct memory access seems unnecessarily constraining. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services."
Watch the video: https://wp.me/p3RLHQ-klC
Learn more: http://htor.inf.ethz.ch/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document provides an overview of the libfabric interface for high performance networking. It discusses the high-level architecture including interfaces, services, and the object model. It describes the control interface for querying capabilities and attributes. It also presents a simple ping-pong example using reliable datagram messaging endpoints to illustrate basic usage.
The document provides an overview and agenda for the libfabric software interface. It discusses the design guidelines, user requirements, software development strategies, architecture, providers, and getting started with libfabric. The key points are that libfabric provides portable low-level networking interfaces, has an object-oriented design, supports multiple fabric hardware through providers, and gives applications high-level and low-level interfaces to optimize performance.
Application Performance & Flexibility on Exokernel Systems paper reviewVimukthi Wickramasinghe
This document provides an overview of operating system structures and summarizes a research paper discussing the performance of exokernel systems. It introduces layered and alternative OS structures like microkernels and exokernels. The paper discusses the principles of exokernels and evaluates the performance of the Xok exokernel and its components like the XN storage system. Performance tests show Xok can match or exceed UNIX performance for most applications and significantly outperform UNIX for I/O intensive servers like HTTP servers. While exokernels provide advantages like flexibility and performance, their interfaces can be complex and lead to code management issues.
Open vSwitch Offload: Conntrack and the Upstream KernelNetronome
Offloading all or part of the Open vSwitch datapath to SmartNICs has been shown to not only release CPU resources on the server, but improve traffic processing performance. Recently steps have been made to support such offloading in the upstream Linux kernel. This has focused on creating an OVS datapath using the TC flower filter and utilizing the offload hooks already present here. This presentation focuses on how Connection Tracking (Conntrack) may fit into this model. It describes current work being undertaken with the Netfilter community to allow offloading of Conntrack entries. It continues to link this work with the offloading of Conntrack rules within OVS-TC.
Ricardo Mourato gave a presentation on hacking the QNX RTOS. Some key points:
- QNX is a real-time operating system used in embedded systems like medical devices, robots, and cars. It has a microkernel architecture for reliability.
- Potential vulnerabilities were demonstrated, like exploiting default services like Telnet and FTP to gain root access, or abusing the QCONN debugging protocol.
- The Qnet inter-process communication could allow accessing resources like files and processes remotely in a transparent way.
- A live demonstration showed exploiting these avenues to hack into a QNX system remotely or locally. Default configurations and services provide initial access points to attack the system.
LCU13: Deep Dive into ARM Trusted Firmware
Resource: LCU13
Name: Deep Dive into ARM Trusted Firmware
Date: 31-10-2013
Speaker: Dan Handley / Charles Garcia-Tobin
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
Bryan McCoid discusses using eBPF, XDP, and io_uring for high performance networking. XDP allows programs to process packets in the kernel without loading modules. AF_XDP sockets use eBPF to route packets between kernel and userspace via ring buffers. McCoid is building a Rust runtime called Glommio to interface with these techniques. The runtime integrates with io_uring and allows multiple design patterns for receiving packets from AF_XDP sockets.
The document provides an overview of the libfabric interface for high performance networking. It discusses the high-level architecture including interfaces, services, and the object model. It describes the control interface for querying capabilities and attributes. It also presents a simple ping-pong example using reliable datagram messaging endpoints to illustrate basic usage.
The document provides an overview and agenda for the libfabric software interface. It discusses the design guidelines, user requirements, software development strategies, architecture, providers, and getting started with libfabric. The key points are that libfabric provides portable low-level networking interfaces, has an object-oriented design, supports multiple fabric hardware through providers, and gives applications high-level and low-level interfaces to optimize performance.
Application Performance & Flexibility on Exokernel Systems paper reviewVimukthi Wickramasinghe
This document provides an overview of operating system structures and summarizes a research paper discussing the performance of exokernel systems. It introduces layered and alternative OS structures like microkernels and exokernels. The paper discusses the principles of exokernels and evaluates the performance of the Xok exokernel and its components like the XN storage system. Performance tests show Xok can match or exceed UNIX performance for most applications and significantly outperform UNIX for I/O intensive servers like HTTP servers. While exokernels provide advantages like flexibility and performance, their interfaces can be complex and lead to code management issues.
Open vSwitch Offload: Conntrack and the Upstream KernelNetronome
Offloading all or part of the Open vSwitch datapath to SmartNICs has been shown to not only release CPU resources on the server, but improve traffic processing performance. Recently steps have been made to support such offloading in the upstream Linux kernel. This has focused on creating an OVS datapath using the TC flower filter and utilizing the offload hooks already present here. This presentation focuses on how Connection Tracking (Conntrack) may fit into this model. It describes current work being undertaken with the Netfilter community to allow offloading of Conntrack entries. It continues to link this work with the offloading of Conntrack rules within OVS-TC.
Ricardo Mourato gave a presentation on hacking the QNX RTOS. Some key points:
- QNX is a real-time operating system used in embedded systems like medical devices, robots, and cars. It has a microkernel architecture for reliability.
- Potential vulnerabilities were demonstrated, like exploiting default services like Telnet and FTP to gain root access, or abusing the QCONN debugging protocol.
- The Qnet inter-process communication could allow accessing resources like files and processes remotely in a transparent way.
- A live demonstration showed exploiting these avenues to hack into a QNX system remotely or locally. Default configurations and services provide initial access points to attack the system.
LCU13: Deep Dive into ARM Trusted Firmware
Resource: LCU13
Name: Deep Dive into ARM Trusted Firmware
Date: 31-10-2013
Speaker: Dan Handley / Charles Garcia-Tobin
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
Bryan McCoid discusses using eBPF, XDP, and io_uring for high performance networking. XDP allows programs to process packets in the kernel without loading modules. AF_XDP sockets use eBPF to route packets between kernel and userspace via ring buffers. McCoid is building a Rust runtime called Glommio to interface with these techniques. The runtime integrates with io_uring and allows multiple design patterns for receiving packets from AF_XDP sockets.
Do you think of cheetahs not RabbitMQ when you hear the word Swift? Think a Nova is just a giant exploding star, not a cloud compute engine. This deck (presented at the OpenStack Boston meetup) provides introduction will answer your many questions. It covers the basic components including: Nova, Swift, Cinder, Keystone, Horizon and Glance.
Linux Containers(LXC) allow running multiple isolated Linux instances (containers) on the same host.
Containers share the same kernel with anything else that is running on it, but can be constrained to only use a defined amount of resources such as CPU, memory or I/O.
A container is a way to isolate a group of processes from the others on a running Linux system.
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpJames Denton
Architecting a private cloud to meet the use cases of its users can be a daunting task. How do you determine which of the many L2/L3 Neutron plugins and drivers to implement? Does network performance outweigh reliability? Are overlay networks just as performant as VLAN networks? The answers to these questions will drive the appropriate technology choice.
In this presentation, we will look at many of the common drivers built around the ML2 framework, including LinuxBridge, OVS, OVS+DPDK, SR-IOV, and more, and will provide performance data to help drive decisions around selecting a technology that's right for the situation. We will discuss our experience with some of these technologies, and the pros and cons of one technology over another in a production environment.
Simplifying Your IT Workflow with Katello and ForemanNikhil Kathole
This document discusses how Foreman and Katello can simplify IT workflows by managing infrastructure lifecycles, contents, and configurations. Foreman allows provisioning of various environments and operating systems. Katello adds content management capabilities like repositories and package updates. Ansible roles can be deployed and scheduled through Foreman for configuration management. OpenSCAP is also integrated for security compliance and vulnerability assessments. The presenter takes questions at the end on using Foreman and its future integrations.
Cisco's journey from Verbs to LibfabricJeff Squyres
This document summarizes Cisco's transition from using the Verbs API to using the Libfabric API for their usNIC network interface card. The Verbs API has limitations that make it difficult to support Ethernet features. Libfabric addresses these issues and more closely matches Cisco's hardware. Performance tests show Libfabric outperforming Verbs. Open MPI was adapted to support Libfabric through new plugins. This allows Libfabric to be used for both provider-specific and portable communication, benefiting MPI implementations. Cisco believes Libfabric is the best path forward as it matches their hardware, has performance benefits, and features MPI implementations have wanted.
This document provides an overview of Open Fabric Interfaces (OFI), which are high-performance APIs that allow applications to access fabric hardware with low latency. OFI defines interfaces for control, communication, completion, and data transfer services. It has an object-oriented design with objects like fabrics, domains, endpoints, and memory regions. OFI supports both connected and connectionless communication models and is agnostic to the underlying networking protocols and hardware implementations.
Netronome's half-day tutorial on host data plane acceleration at ACM SIGCOMM 2018 introduced attendees to models for host data plane acceleration and provided an in-depth understanding of SmartNIC deployment models at hyperscale cloud vendors and telecom service providers.
Presenter Bios
Jakub Kicinski is a long term Linux kernel contributor, who has been leading the kernel team at Netronome for the last two years. Jakub’s major contributions include the creation of BPF hardware offload mechanisms in the kernel and bpftool user space utility, as well as work on the Linux kernel side of OVS offload.
David Beckett is a Software Engineer at Netronome with a strong technical background of computer networks including academic research with DDoS. David has expertise in the areas of Linux architecture and computer programming. David has a Masters Degree in Electrical, Electronic Engineering at Queen’s University Belfast and continues as a PhD student studying Emerging Application Layer DDoS threats.
The document provides an overview of a presentation on developing the OpenFabrics Interfaces libfabric. It discusses design guidelines for libfabric, including optimizing the software path to hardware, being scalable, implementation agnostic, and open source. It then covers libfabric architecture, including modes, capabilities, object model, endpoint types, address vectors, data transfer types, and API bootstrap. It provides examples of how Gasnet usage maps libfabric objects and initialization, address exchange, and using multi-receive buffers. The presentation aims to provide a lightning fast introduction to libfabric version 1.5.
This document summarizes and compares the function call flow of the Linux NVMe driver on legacy versions and newer versions that use blk-mq.
In legacy versions before 3.19, bios bypass the block layer and are directly submitted to the NVMe driver. In versions 3.19 and later, bios go through blk-mq and are converted to requests that are queued to the NVMe driver.
The document outlines the key steps in the function call flow for legacy versions, which include converting bios to iods and submitting them, as well as setting up callback information. It then describes the similar but modified process for blk-mq versions, which includes building requests from bios, flushing requests between software queues, and
This document discusses KVM virtualization and why it is considered the best platform. It states that KVM provides high performance, strong security through EAL4+ certification and SE Linux, and can save customers up to 70% on costs compared to other solutions. It also supports various operating systems and works with Red Hat products like OpenStack and Red Hat Enterprise Virtualization for managing virtualization. Charts are included showing KVM outperforming VMware on benchmark tests using different CPU core counts.
introduction to linux kernel tcp/ip ptocotol stack monad bobo
This document provides an introduction and overview of the networking code in the Linux kernel source tree. It discusses the different layers including link (L2), network (L3), and transport (L4) layers. It describes the input and output processing, device interfaces, traffic directions, and major developers for each layer. Config and benchmark tools are also mentioned. Resources for further learning about the Linux kernel networking code are provided at the end.
LCU13: An Introduction to ARM Trusted FirmwareLinaro
Resource: LCU13
Name: An Introduction to ARM Trusted Firmware
Date: 28-10-2013
Speaker: Andrew Thoelke
Video: http://www.youtube.com/watch?v=q32BEMMxmfw
Perf is a Linux profiler tool that uses performance monitoring hardware to count various events like CPU cycles, instructions, and cache misses. It can count events for a single thread, entire process, specific CPUs, or system-wide. Perf stat is used to count events during process execution, while perf record collects profiling data in a file for later analysis with perf report.
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
Suricata is a network threat detection engine using network packets capture to reconstruct the traffic till the application layer and find threats on the network using rules that define behavior to detect. This task is really CPU intensive and discarding non interesting traffic is a solution to enable a scaling of Suricata to 40gbps and other.
This talk will present the latest evolution of Suricata that knows uses eBPF and XDP to bypass traffic. Suricata 5.0 is supporting the hardware XDP to provide ypass with network card such as Netronome. It also takes advantage of pinned maps to get persistance of the bypassed flows. This talk will cover the different usage of XDP and eBPF in Suricata and shows how it impact performance and usability. If development time permit, the talk will also cover AF_XDP and the impact on this new capture method on Suricata.
Eric Leblond
On demand recording: https://www.nginx.com/resources/webinars/nginx-http2-server-push-grpc/
We discuss new NGINX support for HTTP/2 server push and proxying gRPC traffic.
Check out this webinar to learn:
- About NGINX HTTP/2 support
- How to use HTTP/2 server push with NGINX
- How to proxy gRPC traffic using NGINX
- How to configure both features, with live demonstrations
KFServing - Serverless Model InferencingAnimesh Singh
Deep dive into KFServing: Serverless Model Inferencing Platform built on top of KNative and Istio. Part of the Kubeflow project, and deployed in production across organizations.
The document discusses various data structures and functions related to network packet processing in the Linux kernel socket layer. It describes the sk_buff structure that is used to pass packets between layers. It also explains the net_device structure that represents a network interface in the kernel. When a packet is received, the interrupt handler will raise a soft IRQ for processing. The packet will then traverse various protocol layers like IP and TCP to be eventually delivered to a socket and read by a userspace application.
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
- Show you how GoBGP can be used as a software router in conjunction with quagga
- (Tutorial) Walk through the setup of IXP connecting router using GoBGP
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
Stefano will give an introduction to the most common and used programming models for performing parallel I/O on supercomputers. He will first give a broad overview of parallel APIs for programming I/O on supercomputers. He will then introduce MPI I/O, one of the most used programming interfaces for parallel I/O, presenting its basic concepts, providing programming examples and guidelines for achieving high performance I/O on supercomputers.
Visit: https://www.eudat.eu/eudat-summer-school
Advanced Scalable Decomposition Method with MPICH Environment for HPCIJSRD
MPI (Message Passing Interface) has been effectively used in the high performance computing community for years and is the main programming model. The MPI is being widely used to developed parallel programs on computing system such as clusters. The major component of high performance computing (HPC) environment MPI is becoming increasingly prevalent. MPI implementations typically equate an MPI process with an OS-process, resulting in decomposition technique where MPI processes are bound to the physical cores. It integrated approach makes it possible to add more concurrency than available parallelism, while minimizing the overheads related to context switches, scheduling and synchronization. Fiber is used by it to support multiple MPI processes inside an operating system process. There are three widely used MPI libraries, including OPENMPI, MPICH2 and MVAPICH2. This paper works on the decomposition techniques and also integrates the MPI environment with using MPICH2.
Do you think of cheetahs not RabbitMQ when you hear the word Swift? Think a Nova is just a giant exploding star, not a cloud compute engine. This deck (presented at the OpenStack Boston meetup) provides introduction will answer your many questions. It covers the basic components including: Nova, Swift, Cinder, Keystone, Horizon and Glance.
Linux Containers(LXC) allow running multiple isolated Linux instances (containers) on the same host.
Containers share the same kernel with anything else that is running on it, but can be constrained to only use a defined amount of resources such as CPU, memory or I/O.
A container is a way to isolate a group of processes from the others on a running Linux system.
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpJames Denton
Architecting a private cloud to meet the use cases of its users can be a daunting task. How do you determine which of the many L2/L3 Neutron plugins and drivers to implement? Does network performance outweigh reliability? Are overlay networks just as performant as VLAN networks? The answers to these questions will drive the appropriate technology choice.
In this presentation, we will look at many of the common drivers built around the ML2 framework, including LinuxBridge, OVS, OVS+DPDK, SR-IOV, and more, and will provide performance data to help drive decisions around selecting a technology that's right for the situation. We will discuss our experience with some of these technologies, and the pros and cons of one technology over another in a production environment.
Simplifying Your IT Workflow with Katello and ForemanNikhil Kathole
This document discusses how Foreman and Katello can simplify IT workflows by managing infrastructure lifecycles, contents, and configurations. Foreman allows provisioning of various environments and operating systems. Katello adds content management capabilities like repositories and package updates. Ansible roles can be deployed and scheduled through Foreman for configuration management. OpenSCAP is also integrated for security compliance and vulnerability assessments. The presenter takes questions at the end on using Foreman and its future integrations.
Cisco's journey from Verbs to LibfabricJeff Squyres
This document summarizes Cisco's transition from using the Verbs API to using the Libfabric API for their usNIC network interface card. The Verbs API has limitations that make it difficult to support Ethernet features. Libfabric addresses these issues and more closely matches Cisco's hardware. Performance tests show Libfabric outperforming Verbs. Open MPI was adapted to support Libfabric through new plugins. This allows Libfabric to be used for both provider-specific and portable communication, benefiting MPI implementations. Cisco believes Libfabric is the best path forward as it matches their hardware, has performance benefits, and features MPI implementations have wanted.
This document provides an overview of Open Fabric Interfaces (OFI), which are high-performance APIs that allow applications to access fabric hardware with low latency. OFI defines interfaces for control, communication, completion, and data transfer services. It has an object-oriented design with objects like fabrics, domains, endpoints, and memory regions. OFI supports both connected and connectionless communication models and is agnostic to the underlying networking protocols and hardware implementations.
Netronome's half-day tutorial on host data plane acceleration at ACM SIGCOMM 2018 introduced attendees to models for host data plane acceleration and provided an in-depth understanding of SmartNIC deployment models at hyperscale cloud vendors and telecom service providers.
Presenter Bios
Jakub Kicinski is a long term Linux kernel contributor, who has been leading the kernel team at Netronome for the last two years. Jakub’s major contributions include the creation of BPF hardware offload mechanisms in the kernel and bpftool user space utility, as well as work on the Linux kernel side of OVS offload.
David Beckett is a Software Engineer at Netronome with a strong technical background of computer networks including academic research with DDoS. David has expertise in the areas of Linux architecture and computer programming. David has a Masters Degree in Electrical, Electronic Engineering at Queen’s University Belfast and continues as a PhD student studying Emerging Application Layer DDoS threats.
The document provides an overview of a presentation on developing the OpenFabrics Interfaces libfabric. It discusses design guidelines for libfabric, including optimizing the software path to hardware, being scalable, implementation agnostic, and open source. It then covers libfabric architecture, including modes, capabilities, object model, endpoint types, address vectors, data transfer types, and API bootstrap. It provides examples of how Gasnet usage maps libfabric objects and initialization, address exchange, and using multi-receive buffers. The presentation aims to provide a lightning fast introduction to libfabric version 1.5.
This document summarizes and compares the function call flow of the Linux NVMe driver on legacy versions and newer versions that use blk-mq.
In legacy versions before 3.19, bios bypass the block layer and are directly submitted to the NVMe driver. In versions 3.19 and later, bios go through blk-mq and are converted to requests that are queued to the NVMe driver.
The document outlines the key steps in the function call flow for legacy versions, which include converting bios to iods and submitting them, as well as setting up callback information. It then describes the similar but modified process for blk-mq versions, which includes building requests from bios, flushing requests between software queues, and
This document discusses KVM virtualization and why it is considered the best platform. It states that KVM provides high performance, strong security through EAL4+ certification and SE Linux, and can save customers up to 70% on costs compared to other solutions. It also supports various operating systems and works with Red Hat products like OpenStack and Red Hat Enterprise Virtualization for managing virtualization. Charts are included showing KVM outperforming VMware on benchmark tests using different CPU core counts.
introduction to linux kernel tcp/ip ptocotol stack monad bobo
This document provides an introduction and overview of the networking code in the Linux kernel source tree. It discusses the different layers including link (L2), network (L3), and transport (L4) layers. It describes the input and output processing, device interfaces, traffic directions, and major developers for each layer. Config and benchmark tools are also mentioned. Resources for further learning about the Linux kernel networking code are provided at the end.
LCU13: An Introduction to ARM Trusted FirmwareLinaro
Resource: LCU13
Name: An Introduction to ARM Trusted Firmware
Date: 28-10-2013
Speaker: Andrew Thoelke
Video: http://www.youtube.com/watch?v=q32BEMMxmfw
Perf is a Linux profiler tool that uses performance monitoring hardware to count various events like CPU cycles, instructions, and cache misses. It can count events for a single thread, entire process, specific CPUs, or system-wide. Perf stat is used to count events during process execution, while perf record collects profiling data in a file for later analysis with perf report.
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
Suricata is a network threat detection engine using network packets capture to reconstruct the traffic till the application layer and find threats on the network using rules that define behavior to detect. This task is really CPU intensive and discarding non interesting traffic is a solution to enable a scaling of Suricata to 40gbps and other.
This talk will present the latest evolution of Suricata that knows uses eBPF and XDP to bypass traffic. Suricata 5.0 is supporting the hardware XDP to provide ypass with network card such as Netronome. It also takes advantage of pinned maps to get persistance of the bypassed flows. This talk will cover the different usage of XDP and eBPF in Suricata and shows how it impact performance and usability. If development time permit, the talk will also cover AF_XDP and the impact on this new capture method on Suricata.
Eric Leblond
On demand recording: https://www.nginx.com/resources/webinars/nginx-http2-server-push-grpc/
We discuss new NGINX support for HTTP/2 server push and proxying gRPC traffic.
Check out this webinar to learn:
- About NGINX HTTP/2 support
- How to use HTTP/2 server push with NGINX
- How to proxy gRPC traffic using NGINX
- How to configure both features, with live demonstrations
KFServing - Serverless Model InferencingAnimesh Singh
Deep dive into KFServing: Serverless Model Inferencing Platform built on top of KNative and Istio. Part of the Kubeflow project, and deployed in production across organizations.
The document discusses various data structures and functions related to network packet processing in the Linux kernel socket layer. It describes the sk_buff structure that is used to pass packets between layers. It also explains the net_device structure that represents a network interface in the kernel. When a packet is received, the interrupt handler will raise a soft IRQ for processing. The packet will then traverse various protocol layers like IP and TCP to be eventually delivered to a socket and read by a userspace application.
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
- Show you how GoBGP can be used as a software router in conjunction with quagga
- (Tutorial) Walk through the setup of IXP connecting router using GoBGP
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
Stefano will give an introduction to the most common and used programming models for performing parallel I/O on supercomputers. He will first give a broad overview of parallel APIs for programming I/O on supercomputers. He will then introduce MPI I/O, one of the most used programming interfaces for parallel I/O, presenting its basic concepts, providing programming examples and guidelines for achieving high performance I/O on supercomputers.
Visit: https://www.eudat.eu/eudat-summer-school
Advanced Scalable Decomposition Method with MPICH Environment for HPCIJSRD
MPI (Message Passing Interface) has been effectively used in the high performance computing community for years and is the main programming model. The MPI is being widely used to developed parallel programs on computing system such as clusters. The major component of high performance computing (HPC) environment MPI is becoming increasingly prevalent. MPI implementations typically equate an MPI process with an OS-process, resulting in decomposition technique where MPI processes are bound to the physical cores. It integrated approach makes it possible to add more concurrency than available parallelism, while minimizing the overheads related to context switches, scheduling and synchronization. Fiber is used by it to support multiple MPI processes inside an operating system process. There are three widely used MPI libraries, including OPENMPI, MPICH2 and MVAPICH2. This paper works on the decomposition techniques and also integrates the MPI environment with using MPICH2.
This is the material of a Burst Buffer training that was presented for the early users program. It covers introduction about parallel I/O, Lustre, Darshan tool, Burst Buffer, and optimization parameters for MPI I/O.
Full video: https://youtu.be/8zLcZmiTweg
The document proposes a scalable AI accelerator ASIC platform for edge AI processing. It describes a high-level architecture based on a scalable AI compute fabric that allows for fast learning and inference. The architecture is flexible and can scale from single-chip solutions to multi-chip solutions connected via high-speed interfaces. It also provides details on the AI compute fabric, processing elements, and how the platform could enable high-performance edge AI processing.
The document describes an IBM workshop on CAPI and OpenCAPI technologies. It provides an overview of FPGA acceleration using SNAP, including how SNAP simplifies FPGA programming using a C/C++ based approach. Examples of use cases for FPGA acceleration like video processing and machine learning inference are also presented.
This document summarizes the key components of a digital media production infrastructure, including the network, hard disks, and file systems.
The network must support high throughput without packet loss or delays to support real-time media workflows. Point-to-point switched Ethernet and layer 3 routing address limitations of shared Ethernet.
Hard disks are most efficient with large sequential access rather than small random access. Disk striping and intelligent file systems can reduce random access.
File systems must support non-blocking access across many clients and disks without oversubscription to ensure resources are available for media production needs.
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized columnar memory format. It provides libraries and messaging for moving data between languages and services without serialization. The presenter discusses their motivation for creating Go bindings for Apache Arrow via C++ to share data between Go and Python programs using the same memory format. They explain several challenges of this approach, such as different memory managers in Go and C++, and solutions like generating wrapper code and handling memory with finalizers.
pythonOCC presentation at the 11th NASA-ESA Workshop on Product Data Exchange (http://step.nasa.gov/pde2009/).
Abstract : The combination of open-source software and open standards are known to be considered as a key success factor of further technology development. The interest around the STEP standard is growing these last years, since the release of Application Protocols that allow to foresee new software solutions able to ensure a consistent product data description covering the whole lifecycle. This presentation focuses on the features of a free 3D modeling/data exchange library in the context of a research work dealing with the STEP AP239 standard for PDM/ERP interoperability. The goal is to merge these two approaches and describe the product from design to manufacturing, with a granularity that allow to keep the control over product configuration.
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). To accomplish this, our system consumes a massive amount of events from different sources.
The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python/Tensorflow and Apache Flink as the streaming engine. Enablement of data science tools for machine learning and a process that allows for faster deployment is of growing importance for the business. Topics covered in this talk include:
* Examples for dynamic pricing based on real-time event streams, including location of driver, ride requests, user session event and based on machines learning models
* Comparison of legacy system and new streaming platform for dynamic pricing
* Processing live events in realtime to generate features for machine learning models
* Overview of streaming platform architecture and technology stack
* Apache Beam portability framework as bridge to distributed execution without code rewrite for JVM based streaming engine
* Lessons learned
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
Streaming your Lyft Ride Prices
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). To accomplish this, our system consumes a massive amount of events from different sources.
The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python/Tensorflow and Apache Flink as the streaming engine. Enablement of data science tools for machine learning and a process that allows for faster deployment is of growing importance for the business. Topics covered in this talk include:
* Examples for dynamic pricing based on real-time event streams, including location of driver, ride requests, user session event and based on machines learning models
* Comparison of legacy system and new streaming platform for dynamic pricing
* Processing live events in realtime to generate features for machine learning models
* Overview of streaming platform architecture and technology stack
* Apache Beam portability framework as bridge to distributed execution without code rewrite for JVM based streaming engine
* Lessons learned
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)Igalia
By Andy Wingo.
Snabb is an open-source toolkit for building fast, flexible network functions. Since its beginnings in 2012, Snabb has seen some modest deployment success ranging from simple one-off diagnosis tools to border routers that process all IPv4 traffic for entire countries. This talk will give an introduction to Snabb. After going over Snabb's fundamental components and how they combine, the talk will move on to examples of how network engineers are taking advantage of Snabb in practice, mentioning a few of the many open-source network functions built on Snabb.
(c) RIPE 77
15 - 19 October 2018
Amsterdam, Netherlands
https://ripe77.ripe.net
The goal of the project “An optic’s life” is, to predict the time when an optical transceiver will reach its real end-of-life-time based on the actual setup in the datacenter / colocation.
The road ahead for scientific computing with PythonRalf Gommers
1) The PyData ecosystem, including NumPy, SciPy, and scikit-learn, faces technical challenges related to fragmentation of array libraries, lack of parallelism, packaging constraints, and performance issues for non-vectorized algorithms.
2) There are also social challenges around sustainability of key projects due to limited funding and maintainers, tensions with proprietary involvement from large tech companies, and academia's role in supporting open-source scientific software.
3) NumPy is working to address these issues through efforts like the Array API standardization, improved extensibility and performance, and growing autonomous teams and diversity within the community.
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkDataWorks Summit
The amount of information coming from a Cloud deployment, that could be used to have a better situational awareness, and operate it efficiently is huge. Tools as the ones provided by Apache foundation can be used to build a solution to that challenge.
Nowadays Cloud deployments are pervasive in businesses, with scalability and multi tenancy as their core capabilities. This means that these deployments can grow easily beyond 1000 nodes and efficient operation of these huge clusters requires real time log analysis, metrics, events and configuration data. Performing correlation and finding patterns, not just to get to root causes but also to predict failures and reduce risk requires tools that go beyond current solutions.
In the prototype developed by Red Hat and KEEDIO (keedio.com), we managed to address the above challenges with the use of Big Data tools like Apache NiFi, Apache Kafka and Apache Flink, that enabled us to process the constant stream of syslog messages (RFC5424) produced by the Infrastructure as a Service, provided by OpenStack services, and also detect common failure patterns that could arise and generate alerts as needed.
This session is an (Intermediate) talk in our Apache Nifi and Data Science track. It focuses on Apache Flink, Apache Nifi, Apache Kafka and is geared towards Architect, Data Scientist, Data Analyst, Developer / Engineer audiences.
Speaker
Miguel Perez Colino, Senior Design Product Manager, Red Hat
Suneel Marthi, Senior Principal Engineer, Red Hat
Similar to RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects (20)
The document discusses the top 5 technologies that all organizations must understand: digital transformation, quantum computing, IoT, 5G, and AI/HPC. It provides an overview of each technology including opportunities and threats to organizations. The document emphasizes that understanding these emerging technologies is mandatory as the information revolution changes many aspects of life and business.
In this deck, Greg Wahl from Advantech presents: Transforming Private 5G Networks.
Advantech Networks & Communications Group is driving innovation in next-generation network solutions with their High Performance Servers. We provide business critical hardware to the world's leading telecom and networking equipment manufacturers with both standard and customized products. Our High Performance Servers are highly configurable platforms designed to balance the best in x86 server-class processing performance with maximum I/O and offload density. The systems are cost effective, highly available and optimized to meet next generation networking and media processing needs.
“Advantech’s Networks and Communication Group has been both an innovator and trusted enabling partner in the telecommunications and network security markets for over a decade, designing and manufacturing products for OEMs that accelerate their network platform evolution and time to market.” Said Advantech Vice President of Networks & Communications Group, Ween Niu. “In the new IP Infrastructure era, we will be expanding our expertise in Software Defined Networking (SDN) and Network Function Virtualization (NFV), two of the essential conduits to 5G infrastructure agility making networks easier to install, secure, automate and manage in a cloud-based infrastructure.”
In addition to innovation in air interface technologies and architecture extensions, 5G will also need a new generation of network computing platforms to run the emerging software defined infrastructure, one that provides greater topology flexibility, essential to deliver on the promises of high availability, high coverage, low latency and high bandwidth connections. This will open up new parallel industry opportunities through dedicated 5G network slices reserved for specific industries dedicated to video traffic, augmented reality, IoT, connected cars etc. 5G unlocks many new doors and one of the keys to its enablement lies in the elasticity and flexibility of the underlying infrastructure.
Advantech’s corporate vision is to enable an intelligent planet. The company is a global leader in the fields of IoT intelligent systems and embedded platforms. To embrace the trends of IoT, big data, and artificial intelligence, Advantech promotes IoT hardware and software solutions with the Edge Intelligence WISE-PaaS core to assist business partners and clients in connecting their industrial chains. Advantech is also working with business partners to co-create business ecosystems that accelerate the goal of industrial intelligence."
Watch the video: https://wp.me/p3RLHQ-lPQ
* Company website: https://www.advantech.com/
* Solution page: https://www2.advantech.com/nc/newsletter/NCG/SKY/benefits.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
In this deck from the Stanford HPC Conference, Katie Lewis from Lawrence Livermore National Laboratory presents: The Incorporation of Machine Learning into Scientific Simulations at Lawrence Livermore National Laboratory.
"Scientific simulations have driven computing at Lawrence Livermore National Laboratory (LLNL) for decades. During that time, we have seen significant changes in hardware, tools, and algorithms. Today, data science, including machine learning, is one of the fastest growing areas of computing, and LLNL is investing in hardware, applications, and algorithms in this space. While the use of simulations to focus and understand experiments is well accepted in our community, machine learning brings new challenges that need to be addressed. I will explore applications for machine learning in scientific simulations that are showing promising results and further investigation that is needed to better understand its usefulness."
Watch the video: https://youtu.be/NVwmvCWpZ6Y
Learn more: https://computing.llnl.gov/research-area/machine-learning
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
In this deck from the Stanford HPC Conference, Nick Nystrom and Paola Buitrago provide an update from the Pittsburgh Supercomputing Center.
Nick Nystrom is Chief Scientist at the Pittsburgh Supercomputing Center (PSC). Nick is architect and PI for Bridges, PSC's flagship system that successfully pioneered the convergence of HPC, AI, and Big Data. He is also PI for the NIH Human Biomolecular Atlas Program’s HIVE Infrastructure Component and co-PI for projects that bring emerging AI technologies to research (Open Compass), apply machine learning to biomedical data for breast and lung cancer (Big Data for Better Health), and identify causal relationships in biomedical big data (the Center for Causal Discovery, an NIH Big Data to Knowledge Center of Excellence). His current research interests include hardware and software architecture, applications of machine learning to multimodal data (particularly for the life sciences) and to enhance simulation, and graph analytics.
Watch the video: https://youtu.be/LWEU1L1o7yY
Learn more: https://www.psc.edu/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using systems intelligence and artificial intelligence/neural networks to enhance semiconductor electronic design automation (EDA) workflows by collecting telemetry data from EDA jobs and infrastructure and analyzing it using complex event processing, machine learning models, and messaging substrates to provide insights that could optimize EDA pipelines and infrastructure. The approach aims to allow both internal and external augmentation of EDA processes and environments through unsupervised and incremental learning.
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
In this deck from the Stanford HPC Conference, Nicole Xu from Stanford University describes how she transformed a common jellyfish into a bionic creature that is part animal and part machine.
"Animal locomotion and bioinspiration have the potential to expand the performance capabilities of robots, but current implementations are limited. Mechanical soft robots leverage engineered materials and are highly controllable, but these biomimetic robots consume more power than corresponding animal counterparts. Biological soft robots from a bottom-up approach offer advantages such as speed and controllability but are limited to survival in cell media. Instead, biohybrid robots that comprise live animals and self- contained microelectronic systems leverage the animals’ own metabolism to reduce power constraints and body as an natural scaffold with damage tolerance. We demonstrate that by integrating onboard microelectronics into live jellyfish, we can enhance propulsion up to threefold, using only 10 mW of external power input to the microelectronics and at only a twofold increase in cost of transport to the animal. This robotic system uses 10 to 1000 times less external power per mass than existing swimming robots in literature and can be used in future applications for ocean monitoring to track environmental changes."
Watch the video: https://youtu.be/HrmJFyvInj8
Learn more: https://sanfrancisco.cbslocal.com/2020/02/05/stanford-research-project-common-jellyfish-bionic-sea-creatures/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Stanford HPC Conference, Peter Dueben from the European Centre for Medium-Range Weather Forecasts (ECMWF) presents: Machine Learning for Weather Forecasts.
"I will present recent studies that use deep learning to learn the equations of motion of the atmosphere, to emulate model components of weather forecast models and to enhance usability of weather forecasts. I will than talk about the main challenges for the application of deep learning in cutting-edge weather forecasts and suggest approaches to improve usability in the future."
Peter is contributing to the development and optimization of weather and climate models for modern supercomputers. He is focusing on a better understanding of model error and model uncertainty, on the use of reduced numerical precision that is optimised for a given level of model error, on global cloud- resolving simulations with ECMWF's forecast model, and the use of machine learning, and in particular deep learning, to improve the workflow and predictions. Peter has graduated in Physics and wrote his PhD thesis at the Max Planck Institute for Meteorology in Germany. He worked as Postdoc with Tim Palmer at the University of Oxford and has taken up a position as University Research Fellow of the Royal Society at the European Centre for Medium-Range Weather Forecasts (ECMWF) in 2017.
Watch the video: https://youtu.be/ks3fkRj8Iqc
Learn more: https://www.ecmwf.int/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from the HPC AI Advisory Council describes how this organization fosters innovation in the high performance computing community.
"The HPC-AI Advisory Council’s mission is to bridge the gap between high-performance computing (HPC) and Artificial Intelligence (AI) use and its potential, bring the beneficial capabilities of HPC and AI to new users for better research, education, innovation and product manufacturing, bring users the expertise needed to operate HPC and AI systems, provide application designers with the tools needed to enable parallel computing, and to strengthen the qualification and integration of HPC and AI system products."
Watch the video: https://wp.me/p3RLHQ-lNz
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Today RIKEN in Japan announced that the Fugaku supercomputer will be made available for research projects aimed to combat COVID-19.
"Fugaku is currently being installed and is scheduled to be available to the public in 2021. However, faced with the devastating disaster unfolding before our eyes, RIKEN and MEXT decided to make a portion of the computational resources of Fugaku available for COVID-19-related projects ahead of schedule while continuing the installation process.
Fugaku is being developed not only for the progress in science, but also to help build the society dubbed as the “Society 5.0” by the Japanese government, where all people will live safe and comfortable lives. The current initiative to fight against the novel coronavirus is driven by the philosophy behind the development of Fugaku."
Initial Projects
Exploring new drug candidates for COVID-19 by "Fugaku"
Yasushi Okuno, RIKEN / Kyoto University
Prediction of conformational dynamics of proteins on the surface of SARS-Cov-2 using Fugaku
Yuji Sugita, RIKEN
Simulation analysis of pandemic phenomena
Nobuyasu Ito, RIKEN
Fragment molecular orbital calculations for COVID-19 proteins
Yuji Mochizuki, Rikkyo University
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
The document discusses how DDN A3I storage solutions and Nvidia's SuperPOD platform can enable HPC at scale. It provides details on DDN's A3I appliances that are optimized for AI and deep learning workloads and validated for Nvidia's DGX-2 SuperPOD reference architecture. The solutions are said to deliver the fastest performance, effortless scaling, reliability and flexibility for data-intensive workloads.
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
Today Xilinx announced Versal Premium, the third series in the Versal ACAP portfolio. The Versal Premium series features highly integrated, networked and power-optimized cores and the industry’s highest bandwidth and compute density on an adaptable platform. Versal Premium is designed for the highest bandwidth networks operating in thermally and spatially constrained environments, as well as for cloud providers who need scalable, adaptable application acceleration.
Versal is the industry’s first adaptive compute acceleration platform (ACAP), a revolutionary new category of heterogeneous compute devices with capabilities that far exceed those of conventional silicon architectures. Developed on TSMC’s 7-nanometer process technology, Versal Premium combines software programmability with dynamically configurable hardware acceleration and pre-engineered connectivity and security features to enable a faster time-to- market. The Versal Premium series delivers up to 3X higher throughput compared to current generation FPGAs, with built-in Ethernet, Interlaken, and cryptographic engines that enable fast and secure networks. The series doubles the compute density of currently deployed mainstream FPGAs and provides the adaptability to keep pace with increasingly diverse and evolving cloud and networking workloads.
Learn more: https://insidehpc.com/2020/03/xilinx-announces-versal-premium-acap-for-network-and-cloud-acceleration/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
In this video from the Rice Oil & Gas Conference, Chin Fang from Zettar presents: Moving Massive Amounts of Data across Any Distance Efficiently.
The objective of this talk is to present two on-going projects aiming at improving and ensuring highly efficient bulk transferring or streaming of massive amounts of data over digital connections across any distance. It examines the current state of the art, a few very common misconceptions, the differences among the three major type of data movement solutions, a current initiative attempting to improve the data movement efficiency from the ground up, and another multi-stage project that shows how to conduct long distance large scale data movement at speed and scale internationally. Both projects have real world motivations, e.g. the ambitious data transfer requirements of Linac Coherent Light Source II (LCLS-II) [1], a premier preparation project of the U.S. DOE Exascale Computing Initiative (ECI) [2]. Their immediate goals are described and explained, together with the solution used for each. Findings and early results are reported. Possible future works are outlined.
Watch the video: https://wp.me/p3RLHQ-lBX
Learn more: https://www.zettar.com/
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Rice Oil & Gas Conference, Bradley McCredie from AMD presents: Scaling TCO in a Post Moore's Law Era.
"While foundries bravely drive forward to overcome the technical and economic challenges posed by scaling to 5nm and beyond, Moore’s law alone can provide only a fraction of the performance / watt and performance / dollar gains needed to satisfy the demands of today’s high performance computing and artificial intelligence applications. To close the gap, multiple strategies are required. First, new levels of innovation and design efficiency will supplement technology gains to continue to deliver meaningful improvements in SoC performance. Second, heterogenous compute architectures will create x-factor increases of performance efficiency for the most critical applications. Finally, open software frameworks, APIs, and toolsets will enable broad ecosystems of application level innovation."
Watch the video:
Learn more: http://amd.com
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
In this deck from FOSDEM 2020, Colin Sauze from Aberystwyth University describes the development of a RaspberryPi cluster for teaching an introduction to HPC.
"The motivation for this was to overcome four key problems faced by new HPC users:
* The availability of a real HPC system and the effect running training courses can have on the real system, conversely the availability of spare resources on the real system can cause problems for the training course.
* A fear of using a large and expensive HPC system for the first time and worries that doing something wrong might damage the system.
* That HPC systems are very abstract systems sitting in data centres that users never see, it is difficult for them to understand exactly what it is they are using.
* That new users fail to understand resource limitations, in part because of the vast resources in modern HPC systems a lot of mistakes can be made before running out of resources. A more resource constrained system makes it easier to understand this.
The talk will also discuss some of the technical challenges in deploying an HPC environment to a Raspberry Pi and attempts to keep that environment as close to a "real" HPC as possible. The issue to trying to automate the installation process will also be covered."
Learn more: https://github.com/colinsauze/pi_cluster
and
https://fosdem.org/2020/schedule/events/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
In this deck from FOSDEM 2020, Frank McQuillan from Pivotal presents: Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases.
"In this session we will present an efficient way to train many deep learning model configurations at the same time with Greenplum, a free and open source massively parallel database based on PostgreSQL. The implementation involves distributing data to the workers that have GPUs available and hopping model state between those workers, without sacrificing reproducibility or accuracy. Then we apply optimization algorithms to generate and prune the set of model configurations to try.
Deep neural networks are revolutionizing many machine learning applications, but hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. This is the challenge of model selection. It is time consuming and expensive, especially if you are only training one model at a time.
Massively parallel processing databases can have hundreds of workers, so can you use this parallel compute architecture to address the challenge of model selection for deep nets, in order to make it faster and cheaper?
It’s possible!
We will demonstrate results from this project using a version of Hyperband, which is a well known hyperparameter optimization algorithm, and the deep learning frameworks Keras and TensorFlow, all running on Greenplum database using Apache MADlib. Other topics will include architecture, scalability results and bright opportunities for the future."
Watch the video: https://wp.me/p3RLHQ-lsQ
Learn more: https://fosdem.org/2020/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
1. spcl.inf.ethz.ch
@spcl_eth
T. HOEFLER
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
HPC Advisory Council Swiss Workshop, Lugano, Switzerland
WITH HELP OF ROBERT GERSTENBERGER, MACIEJ BESTA, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL AND ALL OF SPCL
https://eurompi19.inf.ethz.ch
Submit papers by April 15th!
2. spcl.inf.ethz.ch
@spcl_eth
2
The Development of High-Performance Networking Interfaces
1980 1990 2000 2010 2020
Ethernet+TCP/IP
Scalable Coherent Interface Myrinet GM+MX
Fast Messages Quadrics QsNet
Virtual Interface Architecture
IB Verbs
OFED libfabric
Portals 4
sockets
coherent memory access
(active) message based
Cray Gemini
remote direct memory access (RDMA)
triggered operationsOS bypass
protocol offload
zero copy
businessinsider.com
95 / top-100 systems use RDMA
>285 / top-500 systems use RDMA
June 2017
3. spcl.inf.ethz.ch
@spcl_eth
▪ MPI-3.0 supports RMA (“MPI One Sided”)
▪ Designed to react to hardware trends
▪ Majority of HPC networks support RDMA
3
RDMA as MPI-3.0 REMOTE MEMORY ACCESS TRANSPORT
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
4. spcl.inf.ethz.ch
@spcl_eth
▪ MPI-3.0 supports RMA (“MPI One Sided”)
▪ Designed to react to hardware trends
▪ Majority of HPC networks support RDMA
▪ Communication is „one sided” (no involvement of destination)
▪ RMA decouples communication & synchronization
▪ Different from message passing
4
MPI-3.0 REMOTE MEMORY ACCESS
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Proc A Proc B
send
recv
Proc A Proc B
put
two sided one sided
CommunicationCommunication
+
Synchronization Synchronizationsync
6. spcl.inf.ethz.ch
@spcl_eth
6
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
7. spcl.inf.ethz.ch
@spcl_eth
7
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
8. spcl.inf.ethz.ch
@spcl_eth
8
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
9. spcl.inf.ethz.ch
@spcl_eth
9
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
10. spcl.inf.ethz.ch
@spcl_eth
10
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…
Process D (active)
…
…
…
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
11. spcl.inf.ethz.ch
@spcl_eth
11
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
12. spcl.inf.ethz.ch
@spcl_eth
12
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
13. spcl.inf.ethz.ch
@spcl_eth
13
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
14. spcl.inf.ethz.ch
@spcl_eth
14
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Balaji, Hoefler: Using Advanced MPI Tutorial, ISC19, June
16. spcl.inf.ethz.ch
@spcl_eth
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
16
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
17. spcl.inf.ethz.ch
@spcl_eth
17
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
18. spcl.inf.ethz.ch
@spcl_eth
18
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
Window creation
Communication
Synchronization
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
19. spcl.inf.ethz.ch
@spcl_eth
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
▪ foMPI, a fully functional MPI-3 RMA implementation
▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems
▪ XPMEM: a portable Linux kernel module
19
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
20. spcl.inf.ethz.ch
@spcl_eth
20
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
▪ foMPI, a fully functional MPI-3 RMA implementation
▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems
▪ XPMEM: a portable Linux kernel module
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
21. spcl.inf.ethz.ch
@spcl_eth
21
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
▪ Scalable & generic protocols
▪ Can be used on any RDMA network (e.g., OFED/IB)
▪ Window creation, communication and synchronization
▪ foMPI, a fully functional MPI-3 RMA implementation
▪ DMAPP: lowest-level networking API for Cray Gemini/Aries systems
▪ XPMEM: a portable Linux kernel module
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
22. spcl.inf.ethz.ch
@spcl_eth
22
PART 1: SCALABLE WINDOW CREATION
Traditional windows
backwards compatible
(MPI-2)
Time bound: 𝒪 𝑝
Memory bound: 𝒪 𝑝
𝑝 = total number
of processes
Process A
Memory
Process B
Memory
Process C
Memory
0x111
0x123
0x120
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
23. spcl.inf.ethz.ch
@spcl_eth
𝑝 = total number
of processes
23
PART 1: SCALABLE WINDOW CREATION
Allocated windows
Process A
Memory
Process B
Memory
Process C
Memory
Allows MPI
to allocate memory
Time bound: 𝒪 log 𝑝 (𝑤ℎ𝑝)
Memory bound: 𝒪 1
0x123 0x1230x123
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
24. spcl.inf.ethz.ch
@spcl_eth
Process A
Memory
Process B
Memory
Process C
Memory
𝑝 = total number
of processes
24
Local attach/detach
Most flexible
Time bound: 𝒪 𝑝
Memory bound: 𝒪 𝑝
0x129 0x129
0x111
0x123
0x120
PART 1: SCALABLE WINDOW CREATION
Dynamic windows
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
25. spcl.inf.ethz.ch
@spcl_eth
▪ Put and Get:
▪ Direct DMAPP put and get operations or
local (blocking) memcpy (XPMEM)
▪ Accumulate:
▪ DMAPP atomic operations for 64 bit types
▪ ...or fall back to remote locking protocol
▪ MPI datatype handling with MPITypes library [1]
▪ Fast path for contiguous data transfers of common
intrinsic datatypes (e.g., MPI_DOUBLE)
25
PART 2: COMMUNICATION
[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09
Contiguous memory
MPI_Put
dmapp_put_nbi
Remote
process
…
MPI_Compare
_and_swap
dmapp_
acswap_qw_nbi
Remote
process
…
38. spcl.inf.ethz.ch
@spcl_eth
▪ In general, there can be n posting and m
starting processes
▪ In this example there is one posting and
4 starting processes
39
PSCW SCALABLE POST/START MATCHING
Posting process
(opens its window)
j4
j1
j3
j2i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
39. spcl.inf.ethz.ch
@spcl_eth
▪ In general, there can be n posting and m
starting processes
▪ In this example there is one posting and
4 starting processes
40
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
Starting processes
(access remote window)
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
40. spcl.inf.ethz.ch
@spcl_eth
41
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
Local list
▪ Each starting process has a local list
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
41. spcl.inf.ethz.ch
@spcl_eth
42
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
42. spcl.inf.ethz.ch
@spcl_eth
43
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
43. spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
44
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
44. spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
45
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
45. spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
46
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
46. spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
47
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
47. spcl.inf.ethz.ch
@spcl_eth
▪ Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
▪ Each starting process j waits
until the rank of the posting
process i is present in its local list
48
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
i
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
48. spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
49
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
0
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
49. spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
50
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
1
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
50. spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
51
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
2
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
51. spcl.inf.ethz.ch
@spcl_eth
▪ Each completing process increments a counter stored at the posting process
52
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
3
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
52. spcl.inf.ethz.ch
@spcl_eth
53
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
4
▪ Each completing process increments a counter stored at the
posting process
▪ When the counter is equal to the number of starting processes,
the posting process returns from wait
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
53. spcl.inf.ethz.ch
@spcl_eth
54
PSCW PERFORMANCE
Time bound
𝒫𝑠𝑡𝑎𝑟𝑡 = 𝒫 𝑤𝑎𝑖𝑡 = 𝒪 1
𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 𝒪 log 𝑝
Memory bound
𝒪 log 𝑝 (for scalable programs)
Ring Topology
Gerstenberger et al.: „Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided”, CACM, Oct. 2018
55. spcl.inf.ethz.ch
@spcl_eth
▪ Guarantees remote completion
▪ Issues a remote bulk synchronization and an x86 mfence
▪ One of the most performance critical functions, we add only 78 x86 CPU
instructions to the critical path
64
FLUSH SYNCHRONIZATION
Time bound 𝒪 1
Memory bound 𝒪(1)
Process 0 Process 1
inc(counter)
0
counter:
…
inc(counter)
inc(counter)
56. spcl.inf.ethz.ch
@spcl_eth
▪ Guarantees remote completion
▪ Issues a remote bulk synchronization and an x86 mfence
▪ One of the most performance critical functions, we add only 78 x86 CPU
instructions to the critical path
64
FLUSH SYNCHRONIZATION
Time bound 𝒪 1
Memory bound 𝒪(1)
Process 0 Process 1
inc(counter)
counter:
…
inc(counter)
inc(counter)
3
flush
57. spcl.inf.ethz.ch
@spcl_eth
66
▪ Evaluation on Blue Waters System
▪ 22,640 computing Cray XE6 nodes
▪ 724,480 schedulable cores
▪ All microbenchmarks
▪ 4 applications
▪ One nearly full-scale run ☺
PERFORMANCE
59. spcl.inf.ethz.ch
@spcl_eth
68
PERFORMANCE: APPLICATIONS
NAS 3D FFT [1] Performance MILC [2] Application Execution Time
Annotations represent performance gain of foMPI over Cray MPI-1.
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09
[2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12
scale
to 512k procs
scale
to 65k procs
60. spcl.inf.ethz.ch
@spcl_eth
69
The Development of High-Performance Networking Interfaces
1980 1990 2000 2010 2020
Ethernet+TCP/IP
Scalable Coherent Interface Myrinet GM+MX
Fast Messages Quadrics QsNet
Virtual Interface Architecture
IB Verbs
OFED libfabric
Portals 4
sockets
coherent memory access
(active) message based
Cray Gemini
remote direct memory access (RDMA)
triggered operationsOS bypass
protocol offload
zero copy
What next?
Fundamental development: CPU core speed cannot keep up with the growing
network bandwidths! Already today, cannot process packets at line rate!
61. spcl.inf.ethz.ch
@spcl_eth
Local Node
Main Memory
RDMA NIC
Core i7 Haswell
L3
L2
L1
Regs
PCIe bus
arriving
packets
34 cycles ~11.3ns
11 cycles ~ 3.6ns
4 cycles ~1.3ns
~ 250ns
125 cycles ~41.6ns
Data Processing in modern RDMA networks
DMA
Unit
RDMA
Processing
Remote Nodes (via network)
L2
L1
Regs
34 cycles ~11.3ns
11 cycles ~ 3.6ns
4 cycles ~1.3ns
Input buffer
Mellanox Connect-X5: 1 packet/5ns
Tomorrow (400G): 1 packet/1.2ns
70
62. spcl.inf.ethz.ch
@spcl_eth
71
The future of High-Performance Networking Interfaces
1980 1990 2000 2010 2020
Ethernet+TCP/IP
Scalable Coherent Interface Myrinet GM+MX
Fast Messages Quadrics QsNet
Virtual Interface Architecture
IB Verbs
OFED libfabric
Portals 4
sockets
coherent memory access
(active) message based
Cray Gemini
remote direct memory access (RDMA)
triggered operationsOS bypass
protocol offload
zero copy
95 / top-100 systems use RDMA
>285 / top-500 systems use RDMA
June 2017
fully
programmable
NIC acceleration
sPIN
Streaming Processing
In the Network
Established Principles for Compute Acceleration
PortabilityEase-of-use
ProgrammabilitySpecialization Libraries
Efficiency 4.0
67. spcl.inf.ethz.ch
@spcl_eth
▪ sPIN is a programming abstraction, similar to CUDA or OpenCL combined with OFED or Portals 4
▪ It enables a large variety of NIC implementations!
▪ For example, massively multithreaded HPUs
Including warp-like scheduling strategies
▪ Main goal: sPIN must not obstruct line-rate
▪ Programmer must limit processing time per packet
Little’s Law: 500 instructions per handler, 2.5 GHz, IPC=1, 1 Tb/s → 25 kiB memory
▪ Relies on fast shared memory (processing in packet buffers)
Scratchpad or registers
▪ Quick (single-cycle) handler invocation on packet arrival
Pre-initialized memory & context
▪ Can be implemented in most RDMA NICs with a firmware update
▪ Or in software in programmable (Smart) NICs
76
Possible sPIN implementations
BCM58800 SoC
(Full Linux)
Innova Flex
(Kintex FPGA)
Catapult
(Virtex FPGA)
at 400G, process more than
833 million messages/s
68. spcl.inf.ethz.ch
@spcl_eth
Simulating a sPIN NIC – Ping Pong
77
RDMA
sPIN (stream)
▪ LogGOPSim v2 [1]: combine LogGOPSim (packet-level network)
with gem5 (cycle accurate CPU simulation)
▪ Network (LogGOPSim):
▪ Supports Portals 4 and MPI
▪ Parametrized for future InfiniBand
o=65ns (measured)
g=6.7ns (150 MM/s)
G=2.5ps (400 Gib/s)
Switch L=50ns (measured)
Wire L=33.4ns (10m cable)
▪ NIC HPU
▪ 2.5 GHz ARM Cortex A15 OOO
▪ ARMv8-A 32 bit ISA
▪ Single-cycle access SRAM (no DRAM)
▪ Header matching m=30ns, per packet 2ns
In parallel with g!
35% lower latency
32%higherBW
77
[1] S. Di Girolamo, K. Taranov, T. Schneider, E. Stalder, T. Hoefler, LogGOPSim+gem5: Simulating Network
Offload Engines Over Packet-Switched Networks. Presented at ExaMPI’17
Handlers cost:
18 instructions + 1 Put
Data Layout
Transformation
Network Group
Communication
Distributed Data
Management
69. spcl.inf.ethz.ch
@spcl_eth
Use Case 1: Broadcast acceleration
Message size: 8 Bytes
78
Network Group
Communication
Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004
RDMA
70. spcl.inf.ethz.ch
@spcl_eth
Offloaded collectives
(e.g., ConnectX-2, Portals 4)
7979
RDMA
Use Case 1: Broadcast acceleration Network Group
Communication
Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11
Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004
Message size: 8 Bytes
71. spcl.inf.ethz.ch
@spcl_eth
sPIN
80
Handlers cost:
24 instructions + Log P Puts
Use Case 1: Broadcast acceleration Network Group
Communication
Underwood, K.D., et al., Enabling flexible collective communication offload with triggered operations. HOTI’11
Liu, J., et al., High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 2004
Offloaded collectives
(e.g., ConnectX-2, Portals 4)
RDMA
Message size: 8 Bytes
72. spcl.inf.ethz.ch
@spcl_eth
Parity
Update
Use Case 2: RAID acceleration
RDMA
sPIN
81
Server Node Parity Node
Write
ACK
Parity
ACK
20% lower latency
176%higherBW
Handlers cost:
Server: 58 instructions + 1 Put
Parity: 46 instructions + 1 Put
Distributed Data
Management
Shankar D. et al., High-performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads. ICDCS’17
73. spcl.inf.ethz.ch
@spcl_eth
RDMA
82
4 MiB transfer with varying blocksize
stride = 2 x blocksize
11.44 GiB/s
Input buffer Destination memory
Use Case 3: MPI Datatypes acceleration Data Layout
Transformation
Gropp, W., et al., March. Improving the performance of MPI derived datatypes. MPIDC’99
74. spcl.inf.ethz.ch
@spcl_eth
Use Case 3: MPI Datatypes acceleration
83
Input buffer Destination memory
sPIN
43.6 GiB/s
83
Handlers cost:
54 instructions
Data Layout
Transformation
RDMA
4 MiB transfer with varying blocksize
stride = 2 x blocksize
11.44 GiB/s
Gropp, W., et al., March. Improving the performance of MPI derived datatypes. MPIDC’99
3.8x speedup
76. spcl.inf.ethz.ch
@spcl_eth
Further results and use-cases
85
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
77. spcl.inf.ethz.ch
@spcl_eth
Use Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
K1.tail: 0x88
0x88
K1, V
V
K1.tail: 0x98
86
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
78. spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional Read
Further results and use-cases
87
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 5: Distributed KV Store
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network 20%
40%
60%
Discarded data: 80%
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
79. spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional ReadUse Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
88
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 7: Distributed Transactions
Dragojević, A, et al., No compromises: distributed transactions
with consistency, availability, and performance. SOSP’15
Network
data pkts
log pkts
20%
40%
60%
Discarded data: 80%
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
80. spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional ReadUse Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
89
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 7: Distributed Transactions
Dragojević, A, et al., No compromises: distributed transactions
with consistency, availability, and performance. SOSP’15
Network
Use Case 8: FT Broadcast
Bosilca, G., et al., Failure Detection and Propagation in
HPC systems. SC’16
Network
bcast pkts
redundant bcast pkts
20%
40%
60%
Discarded data: 80%
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
41% lower latency
81. spcl.inf.ethz.ch
@spcl_eth
Use Case 6: Conditional ReadUse Case 5: Distributed KV Store
Further results and use-cases
Kalia, A., et al., Using RDMA efficiently for key-value services. In
ACM SIGCOMM Computer Communication Review, 2014
Network
90
Barthels, C., et al., Designing Databases for Future High-
Performance Networks. IEEE Data Eng. Bulletin, 2017
Use Case 7: Distributed Transactions
Dragojević, A, et al., No compromises: distributed transactions
with consistency, availability, and performance. SOSP’15
Network
Use Case 8: FT Broadcast
Bosilca, G., et al., Failure Detection and Propagation in
HPC systems. SC’16
Use Case 4: MPI Rendezvous Protocol
MILC
POP
coMD
coMD
Cloverleaf
Cloverleaf
64
64
72
360
72
360
5.7M
772M
5.3M
28.1M
2.7M
15.3M
5.5%
3.1%
6.1%
6.5%
5.2%
5.6%
1.9%
2.4%
2.4%
2.8%
2.4%
3.2%
program p msgs ovhd ovhd
65%
22%
60%
58%
53%
42%
red
Network
Use Case 9: Distributed Consensus
20%
40%
60%
Discarded data: 80%
Consensus
István, Z., et al., Consensus in a Box: Inexpensive Coordination in
Hardware. NSDI’16
41% lower latency
The Next 700 sPIN use-cases
… just think about sPIN graph kernels ….
82. spcl.inf.ethz.ch
@spcl_eth
sPIN Streaming Processing in the Network for Network Acceleration
Try it out: https://spcl.inf.ethz.ch/Research/Parallel_Programming/sPIN/Full specification: https://arxiv.org/abs/1709.05483
sPIN beyond RDMA
3,800+ reads in
less than 4 hours
83. spcl.inf.ethz.ch
@spcl_eth
▪ More complex network interfaces
▪ sPIN – just discussed in detail
▪ How to use it in real applications?
▪ How to implement sPIN
▪ Collaboration with Luca Benini, RISC-V SoC
▪ Full FPGA/simulation environment available as open-source
▪ New application use-cases – “Deep Learning is HPC [1]”
▪ Different communication requirements and opportunities for optimization
Mainly sparsity [2] and asynchronity!
▪ Need to be a bit careful though when entering a new field
92
MPI – some new research challenges!
T. Hoefler: “Twelve ways to fool the masses when reporting performance of deep learning workloads”
(my humorous guide to floptimize deep learning)
[1]: Ben-Nun, Hoefler: “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv:1802.09941
[2]: Renggli et al.: “SparCML: High-Performance Sparse Communication for Machine Learning”, arXiv:1802.08021
https://eurompi19.inf.ethz.ch
Submit papers by April 15th!