This document summarizes research on LLVM optimizations for PGAS (Partitioned Global Address Space) programs like Chapel. It discusses generating LLVM IR from Chapel to enable optimizations like LICM (Loop Invariant Code Motion). Evaluations show LLVM optimizations remove many communication operations and improve performance for some applications vs. C code generation. However, LLVM constraints and wide pointer overhead hurt performance for other applications. Future work includes more applications, possibly-remote to definitely-local transformations, and parallel intermediate representations in LLVM.
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15)
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
This talk will start with a deep dive and hands on examples of BPF, possibly the most promising low level technology to address challenges in application and network security, tracing, and visibility. We will discuss how BPF evolved from a simple bytecode language to filter raw sockets for tcpdump to the a JITable virtual machine capable of universally extending and instrumenting both the Linux kernel and user space applications. The introduction is followed by a concrete example of how the Cilium open source project applies BPF to solve networking, security, and load balancing for highly distributed applications. We will discuss and demonstrate how Cilium with the help of BPF can be combined with distributed system orchestration such as Docker to simplify security, operations, and troubleshooting of distributed applications.
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
This document discusses CETH (Common Ethernet Driver Framework), which aims to improve kernel networking performance for virtualization. CETH simplifies NIC drivers by consolidating common functions. It supports various NICs and accelerators. CETH features efficient memory and buffer management, flexible TX/RX scheduling, and a customizable metadata structure. It is being simplified to work with XDP for even higher performance network I/O processing in the kernel. Next steps include further optimizations and measuring performance gains when using CETH with XDP and virtualized environments.
Cilk-M is a work-stealing runtime system that solves the cactus stack problem using thread-local memory mapping (TLMM). Each worker maintains its own deque of frames and manipulates the bottom of the deque like a stack. When a worker runs out of work, it steals frames from the top of a random victim's deque. This allows Cilk-M to achieve linear speedup and bounded stack space while maintaining serial-parallel reciprocity and interoperability with legacy code.
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineYuuki Takano
http://ytakano.github.io/
http://ieeexplore.ieee.org/document/7740332/
Uniform resource locator (URL) filtering is a fundamental technology for intrusion detection, HTTP proxies, content distribution networks, content-centric networks, and many other application areas. Some applications adopt URL filtering to protect user privacy from malicious or insecure websites. Some web browser extensions, such as AdBlock Plus, provide a URL-filtering mechanism for sites that intend to steal sensitive information.
Unfortunately, these extensions are implemented inefficiently, resulting in a slow application that consumes much memory. Although it provides a domain-specific language (DSL) to represent URLs, it internally uses regular expressions and does not take advantage of the benefits of the DSL. In addition, the number of filter rules become large, which makes matters worse.
In this paper, we propose the fast uniform resource identifier- specific filter, which is a domain-specific pseudo-machine for the DSL, to dramatically improve the performance of some browser extensions. Compared with a conventional implementation that internally adopts regular expressions, our proof-of-concept implementation is fast and small memory footprint.
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
This document discusses customizing a deep learning accelerator. It begins with a demonstration of object detection using a Tiny YOLO v1 model on an FPGA-based prototype. It then discusses designing a high-efficiency accelerator with three steps: 1) increasing MAC processing elements and utilization, 2) increasing data supply, and 3) improving energy efficiency. Various neural network models are profiled to analyze memory bandwidth and computational power tradeoffs. The document proposes a customizable architecture and discusses solutions like layer fusion, quantization-aware training, and post-training quantization. Performance estimates using an equation-based profiler for sample models are provided to demonstrate the customized accelerator design.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15)
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
This talk will start with a deep dive and hands on examples of BPF, possibly the most promising low level technology to address challenges in application and network security, tracing, and visibility. We will discuss how BPF evolved from a simple bytecode language to filter raw sockets for tcpdump to the a JITable virtual machine capable of universally extending and instrumenting both the Linux kernel and user space applications. The introduction is followed by a concrete example of how the Cilium open source project applies BPF to solve networking, security, and load balancing for highly distributed applications. We will discuss and demonstrate how Cilium with the help of BPF can be combined with distributed system orchestration such as Docker to simplify security, operations, and troubleshooting of distributed applications.
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
This document discusses CETH (Common Ethernet Driver Framework), which aims to improve kernel networking performance for virtualization. CETH simplifies NIC drivers by consolidating common functions. It supports various NICs and accelerators. CETH features efficient memory and buffer management, flexible TX/RX scheduling, and a customizable metadata structure. It is being simplified to work with XDP for even higher performance network I/O processing in the kernel. Next steps include further optimizations and measuring performance gains when using CETH with XDP and virtualized environments.
Cilk-M is a work-stealing runtime system that solves the cactus stack problem using thread-local memory mapping (TLMM). Each worker maintains its own deque of frames and manipulates the bottom of the deque like a stack. When a worker runs out of work, it steals frames from the top of a random victim's deque. This allows Cilk-M to achieve linear speedup and bounded stack space while maintaining serial-parallel reciprocity and interoperability with legacy code.
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineYuuki Takano
http://ytakano.github.io/
http://ieeexplore.ieee.org/document/7740332/
Uniform resource locator (URL) filtering is a fundamental technology for intrusion detection, HTTP proxies, content distribution networks, content-centric networks, and many other application areas. Some applications adopt URL filtering to protect user privacy from malicious or insecure websites. Some web browser extensions, such as AdBlock Plus, provide a URL-filtering mechanism for sites that intend to steal sensitive information.
Unfortunately, these extensions are implemented inefficiently, resulting in a slow application that consumes much memory. Although it provides a domain-specific language (DSL) to represent URLs, it internally uses regular expressions and does not take advantage of the benefits of the DSL. In addition, the number of filter rules become large, which makes matters worse.
In this paper, we propose the fast uniform resource identifier- specific filter, which is a domain-specific pseudo-machine for the DSL, to dramatically improve the performance of some browser extensions. Compared with a conventional implementation that internally adopts regular expressions, our proof-of-concept implementation is fast and small memory footprint.
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
This document discusses customizing a deep learning accelerator. It begins with a demonstration of object detection using a Tiny YOLO v1 model on an FPGA-based prototype. It then discusses designing a high-efficiency accelerator with three steps: 1) increasing MAC processing elements and utilization, 2) increasing data supply, and 3) improving energy efficiency. Various neural network models are profiled to analyze memory bandwidth and computational power tradeoffs. The document proposes a customizable architecture and discusses solutions like layer fusion, quantization-aware training, and post-training quantization. Performance estimates using an equation-based profiler for sample models are provided to demonstrate the customized accelerator design.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
The document summarizes a tutorial on using the Score-P profiling tool to analyze performance of proxy applications. The tutorial covers Score-P profiling and tracing workflow, capabilities, and demonstrates its use through case studies on various proxy apps like AMG, Laghos, PICSARlite, and NEKbone. Attendees will learn how to instrument code, conduct profiling and tracing runs, and analyze results to find hot regions and scaling trends.
eBPF Debugging Infrastructure - Current TechniquesNetronome
eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.
The document discusses CILK and CILK++, parallel programming languages that allow spawning concurrent tasks. It covers the key language features like spawn and sync, provides examples of Fibonacci implementations, and describes the work stealing runtime system that dynamically schedules tasks across processors. The runtime uses a decentralized work stealing approach where idle processors steal tasks from other processors' task queues to balance workload.
A short but packed course on TCP Dynamic Behavior. It starts by explaining TCP from scratch so the dynamic parts can be understood. Then it dives deep into how TCP behaves in real IP networks in the face of packet losses, delays and other phenomena.
The first version of eBPF hardware offload was merged into the Linux kernel in October 2016 and became part of Linux v4.9. For the last two years the project has been growing and evolving to integrate more closely with the core kernel infrastructure and enable more advanced use cases. This talk will explain the internals of the kernel architecture of the offload and how it allows seamless execution of unmodified eBPF datapaths in HW.
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
This document discusses BPF (Berkeley Packet Filter), a mechanism for filtering network packets on Linux. BPF allows defining filters using an instruction set that is executed against packets to determine whether to accept or drop them. The document provides an overview of how BPF works, demonstrating simple BPF filters, and discusses using BPF for packet filtering and other applications like seccomp.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Porting and Optimization of Numerical Libraries for ARM SVELinaro
By Toshiyuki Imamura, RIKEN AICS
RIKEN and Fujitsu are developing ARM-based numerical libraries optimized with the new feature of ARM-SVE. We present porting status of netlib+SSL-II for ARM-SVE and other OSS. Also, we demonstrate some optimization policies and techniques, especially for the basic numerical linear algebra kernels.
Toshiyuki Imamura Bio
Toshiyuki Imamura is currently a team leader of Large-scale Parallel Numerical Computing Technology at Advanced Institute for Computational Science (AICS), RIKEN. He is in charge of the development of numerical libraries for the post-K project. His research interests include high-performance computing, automatic-tuning technology, eigenvalue computation (algorithm/software/applications), etc. He and his colleagues (Japan Atomic Energy Agency (JAEA) team) were nominated as one of the finalists of Gordon Bell Prize in SC05 and SC06. He is a member of IPSJ, JSIAM, and SIAM.
Email
imamura.toshiyuki@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
eBPF Tooling and Debugging InfrastructureNetronome
eBPF, in particular with its driver-level hook XDP, has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This session will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the disassembling of programs to inspect generated eBPF instructions and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
This document provides an overview of performance tuning with IBM XL C/C++ and Fortran compilers and libraries. It discusses identifying application hot spots and bottlenecks using profiling tools like gprof and perf. It also covers compiler optimization techniques including basic optimizations like inlining and redundancy detection as well as advanced optimizations like interprocedural analysis and whole program optimization. Loop transformations are highlighted as important for improving performance of numerical applications.
Netronome's half-day tutorial on host data plane acceleration at ACM SIGCOMM 2018 introduced attendees to models for host data plane acceleration and provided an in-depth understanding of SmartNIC deployment models at hyperscale cloud vendors and telecom service providers.
Presenter Bios
Jakub Kicinski is a long term Linux kernel contributor, who has been leading the kernel team at Netronome for the last two years. Jakub’s major contributions include the creation of BPF hardware offload mechanisms in the kernel and bpftool user space utility, as well as work on the Linux kernel side of OVS offload.
David Beckett is a Software Engineer at Netronome with a strong technical background of computer networks including academic research with DDoS. David has expertise in the areas of Linux architecture and computer programming. David has a Masters Degree in Electrical, Electronic Engineering at Queen’s University Belfast and continues as a PhD student studying Emerging Application Layer DDoS threats.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Arm tools and roadmap for SVE compiler supportLinaro
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Performance evaluation with Arm HPC tools for SVELinaro
by: Performance evaluation with Arm HPC tools for SVE Miwako Tsuji (RIKEN), Yuetsu Kodama (RIKEN)
The "co-design" is a bi-directional approach where a system would be designed on demand from applications and the applications must be optimized to the system. The performance estimation and evaluation of applications are important for the co-design. In this talk, we focus on the performance evaluation with Arm HPC tools for SVE.
Miwako Tsuji received master and PhD degrees from Information Science and Technology, Hokkaido University. From 2007 to 2013, she was working in University of Hokkaido, University of Tokyo, University of Tsukuba and Universite de Versailles Saint-Quentin-en-Yvelines. She is a research scientist at RIKEN Advanced Institute for Computational Science since 2013. She is a member of the architecture development team of the flagship 2020 project, i.e. post-K computer project, since the project was started in 2014. She is a coauthor of ACM Gordon Bell Prize in 2011.
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
The document evaluates ARM and Intel compilers in vectorizing loops from a particle-in-cell simulation code. While Intel can vectorize all loops, ARM can only vectorize one. Investigation found ARM spilled too many loop-invariant variables to memory in two complex loops, preventing vectorization. Minor improvements to ARM's scalar loops were identified that could provide a good base for vectorization. With obstacles removed and reasonable modifications, ARM's code could surpass Intel's performance.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
This slide deck focuses on eBPF JIT compilation infrastructure and how it plays an important role in the entire eBPF life cycle inside the Linux kernel. First, it does quite a number of control flow checks to reject vulnerable programs and then JIT compiles the eBPF program to either host or offloading target instructions which boost performance. However, there is little documentation about this topic which this slide deck will dive into.
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
Direct Code Execution (DCE) is a userspace kernel network stack that allows running real network stack code in a single process. DCE provides a testing platform that enables reproducible testing, fine-grained parameter tuning, and a development framework for network protocols. It achieves this through a virtualization core layer that runs multiple network nodes within a single process, a kernel layer that replaces the kernel with a shared library, and a POSIX layer that redirects system calls to the kernel library. This allows full control and observability for testing and debugging the network stack.
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
The document summarizes a tutorial on using the Score-P profiling tool to analyze performance of proxy applications. The tutorial covers Score-P profiling and tracing workflow, capabilities, and demonstrates its use through case studies on various proxy apps like AMG, Laghos, PICSARlite, and NEKbone. Attendees will learn how to instrument code, conduct profiling and tracing runs, and analyze results to find hot regions and scaling trends.
eBPF Debugging Infrastructure - Current TechniquesNetronome
eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.
The document discusses CILK and CILK++, parallel programming languages that allow spawning concurrent tasks. It covers the key language features like spawn and sync, provides examples of Fibonacci implementations, and describes the work stealing runtime system that dynamically schedules tasks across processors. The runtime uses a decentralized work stealing approach where idle processors steal tasks from other processors' task queues to balance workload.
A short but packed course on TCP Dynamic Behavior. It starts by explaining TCP from scratch so the dynamic parts can be understood. Then it dives deep into how TCP behaves in real IP networks in the face of packet losses, delays and other phenomena.
The first version of eBPF hardware offload was merged into the Linux kernel in October 2016 and became part of Linux v4.9. For the last two years the project has been growing and evolving to integrate more closely with the core kernel infrastructure and enable more advanced use cases. This talk will explain the internals of the kernel architecture of the offload and how it allows seamless execution of unmodified eBPF datapaths in HW.
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
This document discusses BPF (Berkeley Packet Filter), a mechanism for filtering network packets on Linux. BPF allows defining filters using an instruction set that is executed against packets to determine whether to accept or drop them. The document provides an overview of how BPF works, demonstrating simple BPF filters, and discusses using BPF for packet filtering and other applications like seccomp.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Porting and Optimization of Numerical Libraries for ARM SVELinaro
By Toshiyuki Imamura, RIKEN AICS
RIKEN and Fujitsu are developing ARM-based numerical libraries optimized with the new feature of ARM-SVE. We present porting status of netlib+SSL-II for ARM-SVE and other OSS. Also, we demonstrate some optimization policies and techniques, especially for the basic numerical linear algebra kernels.
Toshiyuki Imamura Bio
Toshiyuki Imamura is currently a team leader of Large-scale Parallel Numerical Computing Technology at Advanced Institute for Computational Science (AICS), RIKEN. He is in charge of the development of numerical libraries for the post-K project. His research interests include high-performance computing, automatic-tuning technology, eigenvalue computation (algorithm/software/applications), etc. He and his colleagues (Japan Atomic Energy Agency (JAEA) team) were nominated as one of the finalists of Gordon Bell Prize in SC05 and SC06. He is a member of IPSJ, JSIAM, and SIAM.
Email
imamura.toshiyuki@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
eBPF Tooling and Debugging InfrastructureNetronome
eBPF, in particular with its driver-level hook XDP, has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This session will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the disassembling of programs to inspect generated eBPF instructions and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
This document provides an overview of performance tuning with IBM XL C/C++ and Fortran compilers and libraries. It discusses identifying application hot spots and bottlenecks using profiling tools like gprof and perf. It also covers compiler optimization techniques including basic optimizations like inlining and redundancy detection as well as advanced optimizations like interprocedural analysis and whole program optimization. Loop transformations are highlighted as important for improving performance of numerical applications.
Netronome's half-day tutorial on host data plane acceleration at ACM SIGCOMM 2018 introduced attendees to models for host data plane acceleration and provided an in-depth understanding of SmartNIC deployment models at hyperscale cloud vendors and telecom service providers.
Presenter Bios
Jakub Kicinski is a long term Linux kernel contributor, who has been leading the kernel team at Netronome for the last two years. Jakub’s major contributions include the creation of BPF hardware offload mechanisms in the kernel and bpftool user space utility, as well as work on the Linux kernel side of OVS offload.
David Beckett is a Software Engineer at Netronome with a strong technical background of computer networks including academic research with DDoS. David has expertise in the areas of Linux architecture and computer programming. David has a Masters Degree in Electrical, Electronic Engineering at Queen’s University Belfast and continues as a PhD student studying Emerging Application Layer DDoS threats.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Arm tools and roadmap for SVE compiler supportLinaro
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Performance evaluation with Arm HPC tools for SVELinaro
by: Performance evaluation with Arm HPC tools for SVE Miwako Tsuji (RIKEN), Yuetsu Kodama (RIKEN)
The "co-design" is a bi-directional approach where a system would be designed on demand from applications and the applications must be optimized to the system. The performance estimation and evaluation of applications are important for the co-design. In this talk, we focus on the performance evaluation with Arm HPC tools for SVE.
Miwako Tsuji received master and PhD degrees from Information Science and Technology, Hokkaido University. From 2007 to 2013, she was working in University of Hokkaido, University of Tokyo, University of Tsukuba and Universite de Versailles Saint-Quentin-en-Yvelines. She is a research scientist at RIKEN Advanced Institute for Computational Science since 2013. She is a member of the architecture development team of the flagship 2020 project, i.e. post-K computer project, since the project was started in 2014. She is a coauthor of ACM Gordon Bell Prize in 2011.
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
The document evaluates ARM and Intel compilers in vectorizing loops from a particle-in-cell simulation code. While Intel can vectorize all loops, ARM can only vectorize one. Investigation found ARM spilled too many loop-invariant variables to memory in two complex loops, preventing vectorization. Minor improvements to ARM's scalar loops were identified that could provide a good base for vectorization. With obstacles removed and reasonable modifications, ARM's code could surpass Intel's performance.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
This slide deck focuses on eBPF JIT compilation infrastructure and how it plays an important role in the entire eBPF life cycle inside the Linux kernel. First, it does quite a number of control flow checks to reject vulnerable programs and then JIT compiles the eBPF program to either host or offloading target instructions which boost performance. However, there is little documentation about this topic which this slide deck will dive into.
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
Direct Code Execution (DCE) is a userspace kernel network stack that allows running real network stack code in a single process. DCE provides a testing platform that enables reproducible testing, fine-grained parameter tuning, and a development framework for network protocols. It achieves this through a virtualization core layer that runs multiple network nodes within a single process, a kernel layer that replaces the kernel with a shared library, and a POSIX layer that redirects system calls to the kernel library. This allows full control and observability for testing and debugging the network stack.
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
Track A-Compilation guiding and adjusting - IBMchiportal
The document summarizes the Embedded Reconfigurable Architecture (ERA) project. The ERA project aims to develop an adaptive platform that can dynamically adjust hardware resources to meet changing performance and power needs. Key components include reconfigurable processing elements, memory hierarchies, and networks. The project involves 10 partners across academia and industry. Work focuses on compilers, operating systems, hardware scheduling, and exploiting tradeoffs between performance and power consumption.
This presentation introduces Data Plane Development Kit overview and basics. It is a part of a Network Programming Series.
First, the presentation focuses on the network performance challenges on the modern systems by comparing modern CPUs with modern 10 Gbps ethernet links. Then it touches memory hierarchy and kernel bottlenecks.
The following part explains the main DPDK techniques, like polling, bursts, hugepages and multicore processing.
DPDK overview explains how is the DPDK application is being initialized and run, touches lockless queues (rte_ring), memory pools (rte_mempool), memory buffers (rte_mbuf), hashes (rte_hash), cuckoo hashing, longest prefix match library (rte_lpm), poll mode drivers (PMDs) and kernel NIC interface (KNI).
At the end, there are few DPDK performance tips.
Tags: access time, burst, cache, dpdk, driver, ethernet, hub, hugepage, ip, kernel, lcore, linux, memory, pmd, polling, rss, softswitch, switch, userspace, xeon
mTCP enables high-performance userspace TCP/IP stacks by bypassing the kernel and reducing system call overhead. It was shown to achieve up to 25x higher throughput than Linux for short flows. The document discusses porting the iperf benchmark to use mTCP, which required only minor changes. Performance tests found that mTCP-ified iperf achieved similar throughput as Linux iperf for different packet sizes, demonstrating mTCP's ability to easily accelerate networking applications with minimal changes. The author concludes mTCP is a simple and effective way to improve TCP performance but notes that for full-featured stacks, a system like NUSE may be preferable as it can provide the high performance of userspace stacks while supporting the full functionality of kernel
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
Ed Warnicke's talk at Open Networking Summit.
All Open Source Networking project depend on having access to a Universal Dataplane that is:
Able to they deployment models: Bare Metal/Embedded/Cloud/Containers/NFVi/VNFs
High performance
Feature Rich
Open with Broad Community support/participation
FD.io provides all of this and more. Come learn more about FD.io and how you can begin using it.
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
For one of our projects, we needed to improve the current content delivery system for terminals. In this talk, I will share our experience in building an efficient data replication system for thousands of terminals. We will touch on architecture decisions and tradeoffs, technologies that we used, and a bit of load testing.
Spoiler: We didn't use Kafka.
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
This document describes using the LibOS framework to build a regression testing system for Linux networking code. LibOS allows running the Linux network stack in a library, enabling deterministic network simulation. Tests can configure virtual networks and run network applications and utilities to identify bugs in networking code by detecting changes in behavior across kernel versions. Example tests check encapsulation protocols like IP-in-IP and detect past kernel bugs. Results are recorded in JUnit format for integration with continuous integration systems.
Hands-On Workshop on Performance Optimization for Intel Xeon Phi Processor Family x200 (formerly Knights Landing) from Colfax International. More information at http://colfaxresearch.com/knl-webinar/
A Library for Emerging High-Performance Computing ClustersIntel® Software
This document discusses the challenges of developing communication libraries for exascale systems using hybrid MPI+X programming models. It describes how current MPI+PGAS approaches use separate runtimes, which can lead to issues like deadlock. The document advocates for a unified runtime that can support multiple programming models simultaneously to avoid such issues and enable better performance. It also outlines MVAPICH2's work on designs like multi-endpoint that integrate MPI and OpenMP to efficiently support emerging highly threaded systems.
This document provides an overview of a hands-on workshop on the Constrained Application Protocol (CoAP). It outlines the agenda which includes introductions to CoAP, the Californium CoAP framework, and hands-on projects. Attendees will work through example CoAP client and server code using the Californium libraries and test their implementations. Advanced CoAP topics like security, proxies, and resource directories are also discussed.
This document provides an agenda and overview for an introduction to OpenCL course. The agenda includes lectures on understanding host programs, kernel programs, memory models, and optimization. Course materials include OpenCL reference cards, specifications, and exercises. An introduction to OpenCL explains that it is an open standard for parallel programming across heterogeneous systems like CPUs and GPUs. The OpenCL platform model includes devices like GPUs that are divided into compute units and processing elements. Kernels define work-items that execute problems in parallel over a domain.
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Jakub Botwicz
Presentation about Cotopaxi toolkit from Black Hat Asia 2019 Arsenal session. Author: Jakub Botwicz
https://www.blackhat.com/asia-19/arsenal/schedule/index.html#cotopaxi-iot-protocols-security-testing-toolkit-14325
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
Apache Beam is a unified programming model for batch and streaming data processing that provides portability across distributed processing backends. It aims to support multiple languages like Java, Python and Go. The Beam Python SDK allows writing pipelines in Python that can run on distributed backends like Apache Flink. Lyft developed a Python SDK runner for Flink that translates Python pipelines to native Flink APIs using the Beam Fn API for communication between the SDK and runner. Future work includes improving performance of Python pipelines on JVM runners and supporting multiple languages in a single pipeline.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages.
In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
In this presentation I talk about our motivation to converting our microservices to run on Kubernetes. I discuss many of the technical challenges we encountered along the way, including networking issues, Java issues, monitoring and alerting, and managing all of our resources!
Similar to LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel- (20)
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
Polyhedral compilation uses the polyhedral model to represent programs as systems of affine inequalities over iteration variables. This allows loop transformations like fusion, distribution, skewing and reversal to be expressed as affine mappings on the iteration space. The key aspects are representing the iteration domain, scheduling functions that determine the execution order of statements, and memory accesses in terms of iteration vectors. Loop transformations are specified by changing the scheduling functions to map iterations to new logical execution times while preserving semantics. This enables optimizations at the level of whole programs or subprograms.
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
This document discusses using machine learning techniques to perform runtime selection of CPUs or GPUs for executing Java programs. It describes challenges in supporting Java features like exceptions on GPUs and accelerating Java programs. Features like loop characteristics, instruction counts, memory accesses are extracted from programs to train an SVM model to predict faster device. Evaluating on 11 apps, the model achieves 97.6-99% accuracy using 5-fold cross validation to avoid overfitting. This runtime selection approach can adapt to new hardware without needing to rebuild performance models.
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
This document discusses research on automatic parallelization for heterogeneous and homogeneous multicore processors. It presents Akihiro Hayashi's PhD defense at Waseda University on this topic. It motivates the need for automatic parallelization due to difficulties in programming multicore processors. It proposes a solution called OSCAR that uses a heterogeneous multicore compiler with APIs to enable automatic parallelization across different processor types. The methodology involves hint directives, parallelization of tasks, power reduction techniques, and generation of executables. It evaluates the approach on media applications using a Renesas multicore processor.
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
Mechatronics is a multidisciplinary field that refers to the skill sets needed in the contemporary, advanced automated manufacturing industry. At the intersection of mechanics, electronics, and computing, mechatronics specialists create simpler, smarter systems. Mechatronics is an essential foundation for the expected growth in automation and manufacturing.
Mechatronics deals with robotics, control systems, and electro-mechanical systems.
Height and depth gauge linear metrology.pdfq30122000
Height gauges may also be used to measure the height of an object by using the underside of the scriber as the datum. The datum may be permanently fixed or the height gauge may have provision to adjust the scale, this is done by sliding the scale vertically along the body of the height gauge by turning a fine feed screw at the top of the gauge; then with the scriber set to the same level as the base, the scale can be matched to it. This adjustment allows different scribers or probes to be used, as well as adjusting for any errors in a damaged or resharpened probe.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...PriyankaKilaniya
Energy efficiency has been important since the latter part of the last century. The main object of this survey is to determine the energy efficiency knowledge among consumers. Two separate districts in Bangladesh are selected to conduct the survey on households and showrooms about the energy and seller also. The survey uses the data to find some regression equations from which it is easy to predict energy efficiency knowledge. The data is analyzed and calculated based on five important criteria. The initial target was to find some factors that help predict a person's energy efficiency knowledge. From the survey, it is found that the energy efficiency awareness among the people of our country is very low. Relationships between household energy use behaviors are estimated using a unique dataset of about 40 households and 20 showrooms in Bangladesh's Chapainawabganj and Bagerhat districts. Knowledge of energy consumption and energy efficiency technology options is found to be associated with household use of energy conservation practices. Household characteristics also influence household energy use behavior. Younger household cohorts are more likely to adopt energy-efficient technologies and energy conservation practices and place primary importance on energy saving for environmental reasons. Education also influences attitudes toward energy conservation in Bangladesh. Low-education households indicate they primarily save electricity for the environment while high-education households indicate they are motivated by environmental concerns.
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-
1. LLVM Optimizations for PGAS Programs
-Case Study: LLVM Wide Pointer Optimizations in Chapel-
CHIUW2014 (co-located with IPDPS 2014),
Phoenix, Arizona
Akihiro Hayashi, Rishi Surendran,
Jisheng Zhao, Vivek Sarkar
(Rice University),
Michael Ferguson
(Laboratory for Telecommunication Sciences)
1
2. Background: Programming Model
for Large-scale Systems
Message Passing Interface (MPI) is a
ubiquitous programming model
but introduces non-trivial complexity due to
message passing semantics
PGAS languages such as Chapel, X10,
Habanero-C and Co-array Fortran
provide high-productivity features:
Task parallelism
Data Distribution
Synchronization
2
3. Motivation:
Chapel Support for LLVM
Widely used and easy to Extend
3
LLVM Intermediate Representation (IR)
x86 Binary
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Compiler
chpl
PPC Binary ARM Binary
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
GPU Binary
UPC
Compiler
5. Our ultimate goal: A compiler that can
uniformly optimize PGAS Programs
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
Two parallel intermediate representations(PIR) as
extensions to LLVM IR
(Runtime-Independent, Runtime-Specific)
5
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary
6. The first step:
LLVM-based Chapel compiler
6
Pictures borrowed from 1) http://chapel.cray.com/logo.html
2) http://llvm.org/Logo.html
Chapel compiler supports LLVM IR generation
This talk discusses the pros and cons of LLVM-based
communication optimizations for Chapel
Wide pointer optimization
Preliminary Performance evaluation & analysis using
three regular applications
7. Chapel language
An object-oriented PGAS language
developed by Cray Inc.
Part of DARPA HPCS program
Key features
Array Operators: zip, replicate, remap,...
Explicit Task Parallelism: begin, cobegin
Locality Control: Locales
Data-Distribution: domain maps
Synchronizations: sync
7
9. The Pros and Cons of using
LLVM for Chapel
Pro: Using address space feature of LLVM
offers more opportunities for
communication optimization than C gen
9
// LLVM IR
%x = load i64 addrspace(100)* %xptr
// C-Code generation
chpl_comm_get(&x, …);
LLVM Optimizations
(e.g. LICM, scalar replacement)
Backend Compiler’s Optimizations
(e.g. gcc –O3)
Few chances of
optimization because
remote accesses are
lowered to chapel Comm
APIs
1. the existing LLVM passes can be
used for communication optimizations
2. Lowered to chapel Comm APIs after
optimizations
// Chapel
x = remoteData;
10. Address Space 100 generation
in Chapel
Address space 100 = possibly-remote
(our convention)
Constructs which generate address space 100
Array Load/Store (Except Local constructs)
Distributed Array
var d = {1..128} dmapped Block(boundingBox={1..128});
var A: [d] int;
Object and Field Load/ Store
class circle { var radius: real; … }
var c1 = new circle(radius=1.0);
On statement
var loc0: int;
on Locales[1] { loc0 = …; }
Ref intent
proc habanero(ref v: int): void { v = …; }
10
Except remote value
forwarding optimization
11. Motivating Example of address
space 100
11
(Pseudo-Code: Before LICM)
for i in 1..N {
// REMOTE GET
%x = load i64 addrspace(100)* %xptr
A(i) = %x;
}
(Pseudo-Code: After LICM)
// REMOTE GET
%x = load i64 addrspace(100)* %xptr
for i in 1..N {
A(i) = %x;
}
LICM by
LLVM
LICM = Loop Invariant Code Motion
12. The Pros and Cons of using
LLVM for Chapel (Cont’d)
Drawback: Using LLVM may lose opportunity for
optimizations and may add overhead at runtime
In LLVM 3.3, many optimizations assume that the
pointer size is the same across all address spaces
12
typedef struct wide_ptr_s {
chpl_localeID_t locale;
void* addr;
} wide_ptr_t;
locale addr
For LLVM Code Generation :
64bit packed pointer
CHPL_WIDE_POINTERS=node16
For C Code Generation :
128bit struct pointer
CHPL_WIDE_POINTERS=struct
wide.locale;
wide.addr;
wide >> 48
wide | 48BITS_MASK;
16bit 48bit
1. Needs more instructions
2. Lose opportunities for Alias analysis
13. Performance Evaluations:
Experimental Methodologies
We tested execution in the following modes
1.C-Struct (--fast)
C code generation + struct pointer + gcc
Conventional Code generation in Chapel
2.LLVM without wide optimization (--fast --llvm)
LLVM IR generation + packed pointer
Does not use address space feature
3.LLVM with wide optimization
(--fast --llvm --llvm-wide-opt)
LLVM IR generation + packed pointer
Use address space feature and apply the existing LLVM
optimizations
13
15. Performance Evaluations:
Details of Compiler & Runtime
Compiler:
Chapel version 1.9.0.23154 (Apr. 2014)
Built with
CHPL_LLVM=llvm
CHPL_WIDE_POINTERS=node16 or struct
CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
CHPL_TASK=qthread
Backend compiler: gcc-4.4.7, LLVM 3.3
Runtime:
GASNet-1.22.0 (ibv-conduit, mpi-spawner)
qthreads-1.10
(2 Shepherds, 6 worker per shepherd) 15
16. Stream-EP
From HPCC benchmark
Array Size: 2^30
16
coforall loc in Locales do on loc {
// per each locale
var A, B, C: [D] real(64);
forall (a, b, c) in zip(A, B, C) do
a = b + alpha * c;
}
17. Stream-EP Result
17
2.56
1.33
0.72
0.41 0.24 0.11
6.62
3.22
1.73
1.01
0.62
0.26
2.45
1.28
0.72
0.40 0.25 0.10
0
1
2
3
4
5
6
7
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
vs.
vs. Overhead of introducing LLVM + packed pointer (2.6x slower)
1
2 3
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.7x faster)
LLVM+wide opt is faster than the conventional C-Struct (1.1x)
18. Stream-EP Analysis
18
C-Struct LLVM w/o wopt LLVM w/ wopt
1.39E+11 1.40E+11 5.46E+10
Dynamic number of Chapel PUT/GET APIs actually executed (16 Locales):
// C-Struct, LLVM w/o wopt
forall (a, b, c) in zip(A, B, C) do
8GETS / 1PUT
// LLVM w/ wopt
6GETS (Get Array Head, offs)
forall (a, b, c) in zip(A, B, C) do
2GETS / 1PUT
LICM by LLVM
20. Cholesky Result
20
Lower is better
2401.32
941.70
730.94
2781.12
1105.38
902.86858.77
283.32 216.48
0.00
500.00
1000.00
1500.00
2000.00
2500.00
3000.00
8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
1
2 3
vs.
vs. Overhead of introducing LLVM + packed pointer (1.2x slower)
1
2
3
_
_
vs.
Performance improvement by LLVM opt (4.2x faster)
LLVM+wide opt is faster than the conventional C-Struct (3.4x)
21. Cholesky Analysis
21
C-Struct LLVM w/o wopt LLVM w/ wopt
1.78E+09 1.97E+09 5.89E+08
Dynamic number of Chapel PUT/GET APIs actually executed (2 Locales):
Obtained with 1,000 x 1,000 input (100x100 tile size)
// C-Struct, LLVM w/o wopt
for jB in zero..tileSize-1 do {
for kB in zero..tileSize-1 do {
4GETS
for iB in zero..tileSize-1 do {
8GETS (+1 GETS w/ LLVM)
1PUT
}}}
// LLVM w/ wopt
for jB in zero..tileSize-1 do {
1GET
for kB in zero..tileSize-1 do {
3GETS
for iB in zero..tileSize-1 do {
2GETS
1PUT
}}}
23. Smithwaterman Result
23
381.23 379.01
1260.31 1263.76
626.38 635.45
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
8 locales 16 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
1
2 3
vs.
vs. Overhead of introducing LLVM + packed pointer (3.3x slower)
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.0x faster)
LLVM+wide opt is slower than the conventional C-Struct (0.6x)
24. Smithwaterman Analysis
24
C-Struct LLVM w/o wopt LLVM w/ wopt
1.41E+08 1.41E+08 5.26E+07
Dynamic number of Chapel PUT/GET APIs actually executed (1 Locale):
Obtained with 1,856 x 1,920 input (232x240 tile size)
// C-Struct, LLVM w/o wopt
for (ii, jj) in tile_1_2d_domain
{
33 GETS
1 PUTS
}
// LLVM w/ wopt
for (ii, jj) in tile_1_2d_domain
{
12 GETS
1 PUTS
}
No LICM though there are opportunities
25. Key Insights
Using address space 100 offers finer-grain
optimization opportunities (e.g. Chapel Array)
25
for i in {1..N} {
data = A(i);
}
for i in 1..N {
head = GET(pointer to array head)
offset1 = GET(offset)
data = GET(head+i*offset1)
}
Opportunities for
1.LICM
2.Aggregation
26. Conclusions
The first performance evaluation and analysis of
LLVM-based Chapel compiler
Capable of utilizing the existing optimizations passes
even for remote data (e.g. LICM)
Removes significant number of Comm APIs
LLVM w/ opt is always better than LLVM w/o opt
Stream-EP, Cholesky
LLVM-based code generation is faster than C-based code
generation (1.04x, 3.4x)
Smithwaterman
LLVM-based code generation is slower than C-based code
generation due to constraints of address space feature in
LLVM
No LICM though there are opportunities
Significant overhead of Packed Wide pointer
26
27. Future Work
Evaluate other applications
Regular applications
Irregular applications
Possibly-Remote to Definitely-Local
transformation by compiler
PIR in LLVM
27
local { A(i) = … } // hint by programmmer
… = A(i); // Definitely Local
on Locales[1] { // hint by programmer
var A: [D] int; // Definitely Local
30. // modules/internal/DefaultRectangular.chpl
class DefaultRectangularArr: BaseArr {
...
var dom : DefaultRectangularDom(rank=rank, idxType=idxType,
stridable=stridable); /* domain */
var off: rank*idxType; /* per-dimension offset (n-based-> 0-based) */
var blk: rank*idxType; /* per-dimension multiplier */
var str: rank*chpl__signedType(idxType); /* per-dimimension stride */
var origin: idxType; /* used for optimization */
var factoredOffs: idxType; /* used for calculating shiftedData */
var data : _ddata(eltType); /* pointer to an actual data */
var shiftedData : _ddata(eltType); /* shifted pointer to an actual data */
var noinit: bool = false;
...
Chapel Array Structure
30
// chpl_module.bc (with LLVM code generation)
%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object = type
{ %chpl_BaseArr_object, %chpl_DefaultRectangularDom_1_int64_t_F_object*, [1 x i64], [1 x i64], [1 x
i64], i64, i64, i64*, i64*, i8 }
31. Example1: Array Store
(very simple)
proc habanero (A) {
A(0) = 1;
}
31
Chapel version: 1.8.0.22047
Compiler option: --llvm --llvm-wide-opt --fast
Add “noinline” attribute to the function to avoid dead code
elimination
Good afternoon everyone. My name is Akihiro Hayashi. I’m a postdoc at Rice university.
Today, I’ll be talking about LLVM-based optimizations for PGAS programs. In particular, I focus on Chapel language and its optimization in this talk.
Let me first talk about the Programming model for Large-scale systems.
Message passing interface is very common programming model for large-scale system. But It is well known that using MPI introduces non-trivial complexity due to message passing semantics.
PGAS languages such as Chapel, X10, Habanero-C and CAF are designed for facilitating programming for large-scale systems by providing high-productivity language features such as task parallelism, data distribution and synchronization.
When it comes to compiler’s optimization, LLVM is emerging compiler infrastructure, which tries to replace the conventional compiler like GCC.
Here is an overview of LLVM.
LLVM defines machine-independent intermediate representation and it also provides powerful analyzer and optimizer for LLVM IR.
If you prepare the frontend that generates LLVM IR, you can analyze and optimize a code in a language independent manner. I think the most famous one is “Clang”, which takes C/C++ and generates LLVM IR.
You’ll finally get target specific binary by using target specific backend.
The most important thing in this slide is that Chapel compiler is now capable of generating LLVM IR.
Here is a big picture.
We think It’s feasible to build LLVM-based compiler that can uniformly analyze and optimize PGAS languages because PGAS languages have similar philosophy and languages design.
That means one sophisticated compiler optimize several kinds of PGAS languages and generates binary for several kinds of supercomputers.
This slide shows the details of universal PGAS compiler. Our plan is to extend LLVM IR to support parallel programs with PGAS and explicit task parallelism.
We’re thinking we defines two kinds of parallel intermediate representations as extensions to LLVM IR.
These are runtime independent IR and Runtime specific IR. You may want to detect task parallel construct and apply some sort of optimization with Runtime independent IR.
In this talk, we focus on LLVM-based chapel compiler to take the first step to the our ultimate goal.
just
Just read
Let’s talk about the pros and cons of using LLVM for Chapel.
We believe good thing to use LLVM is that we can use address space feature of LLVM.
This offers more opportunities for communication optimization than C code generations.
Here are examples of remote get code.
If you use c-code generator, remote get is expressed as chpl_comm_get API. But there are few changes of optimization because remote accesses are already lowered to chapel comm APIs.
On the other hand, if we use LLVM and its address space feature. We can express remote get as one instruction that involves address space 100.
Suppose xptr is loop invariant. We can reduce the redundant Comm API by LICM
But using LLVM has drawback. Chapel uses wide pointer to associate data with node. Wide pointer is as C-struct and you can extract nodeID and address by dot operator.