This talk is about architecture and programming of the Epiphany processor on the Parallella board, discussing step-by-step how to improve and optimize software kernels on such distributed DSP systems. It was held at the "Softwarekonferenz für Parallel Programming, Concurrency
und Multicore-Systeme" in Karlsruhe/Germany 2014.
The document summarizes a lecture on parallel computing with CUDA (Compute Unified Device Architecture). It introduces CUDA as a parallel programming model for GPUs, covering key concepts like memory architecture, host-GPU workload partitioning, programming paradigm, and programming examples. It then outlines the agenda, benefits of GPU computing, and provides details on CUDA programming interfaces, kernels, threads, blocks, and memory hierarchies. Finally, it lists some lab exercises on CUDA programming including HelloWorld, matrix multiplication, and parallel sorting algorithms.
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
In the framework of the Intel Parallel Computing Centre at the Research Campus Garching in Munich, our group at LRZ presents recent results on performance optimization of Gadget-3, a widely used community code for computational astrophysics. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm and focus on threading parallelism optimization, change of the data layout into Structure of Arrays (SoA), compiler auto-vectorization and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon (2.6× on Ivy Bridge) and Xeon Phi (13.7× on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimization solutions to upcoming architectures.
Application developers like to stick to the object-oriented style of programming by designing their application's logic as interaction between different object entities. In this process, every entity is modeled as a C++ class or structure. Array of structure (AoS) maintains a collection of those entities, which makes the code more readable and easier to maintain. But, this user-friendly code can potentially pose a challenge when it comes to vectorization efficiency. Often, the data needed for populating the vector register is gathered since the data is laid out in non-unit stride fashion in the main memory. To make the data layout more vector-friendly, developers often had to change their data structures manually from AoS to a structure of arrays (SoA). Single instruction multiple data (SIMD) layout templates from Intel help developers preserve an AoS interface while programming but, under the hood, the data structure is laid out in an SoA format. This is a win-win solution for both object-oriented and vector-friendly programming.
This presentation demonstrates how to analyze the memory access pattern in your performance-sensitive loops and how to enable the layout templates to make changes from constant- and variable-strided memory accesses to unit-strided memory access wherever possible.
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources (ReConFig2014@Cancun, Mexico)
flipSyrup, a new framework for rapid prototyping is proposed.
The document summarizes the use of LLVM for code generation when recompiling Nintendo games as native games. LLVM provides a full compiler infrastructure that can be used to generate code for various platforms from a common intermediate representation (LLVM bitcode). The document discusses using LLVM for code generation from 6502 assembly to generate native code for emulation. Optimizations available through LLVM are also discussed.
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,paperpublications3
Abstract: A public domain encryption standard is subject to continuous, vigilant, expert cryptanalysis. AES is a symmetric encryption algorithm processing data in block of 128 bits. Under the influence of a key, a 128-bit block is encrypted by transforming it in a unique way into a new block of the same size. To implement AES Rijndael algorithm on FPGA using Verilog and synthesis using Xilinx, Plain text of 128 bit data is considered for encryption using Rijndael algorithm utilizing key. This encryption method is versatile used for military applications. The same key is used for decryption to recover the original 128 bit plain text. For high speed applications, the Non LUT based implementation of AES S-box and inverse S-box is preferred. Development of physical design of AES-128 bit is done using cadence SoC encounter. Performance evaluation of the physical design with respect to area, power, and time has been done. The core consumes 10.11 mW of power for the core area of 330100.742 μm2.
Keywords: Encryption, Decryption Rijndael algorithm, FPGA implementation, Physical Design.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
This document describes PyCoRAM, a Python-based implementation of the CoRAM memory architecture for FPGA-based computing. PyCoRAM provides a high-level abstraction for memory management that decouples computing logic from memory access behaviors. It allows defining memory access patterns using Python control threads. PyCoRAM generates an IP core that integrates with standard IP cores on Xilinx FPGAs using the AMBA AXI4 interconnect. It supports parameterized RTL design and achieves high memory bandwidth utilization of over 84% on two FPGA boards in evaluations of an array summation application.
XDAQ is an open source ecosystem for controlling Arduino-based science experiments from a Linux environment. It includes development tools like Arduino IDE, Eclipse, and Python toolchains as well as libraries for data processing, analysis, visualization, and real-time streaming. XDAQ can be used as a virtual machine appliance or installed on Debian/Ubuntu systems. The ecosystem is released under the GPL license and source code is available on GitHub.
The document summarizes a lecture on parallel computing with CUDA (Compute Unified Device Architecture). It introduces CUDA as a parallel programming model for GPUs, covering key concepts like memory architecture, host-GPU workload partitioning, programming paradigm, and programming examples. It then outlines the agenda, benefits of GPU computing, and provides details on CUDA programming interfaces, kernels, threads, blocks, and memory hierarchies. Finally, it lists some lab exercises on CUDA programming including HelloWorld, matrix multiplication, and parallel sorting algorithms.
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
In the framework of the Intel Parallel Computing Centre at the Research Campus Garching in Munich, our group at LRZ presents recent results on performance optimization of Gadget-3, a widely used community code for computational astrophysics. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm and focus on threading parallelism optimization, change of the data layout into Structure of Arrays (SoA), compiler auto-vectorization and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon (2.6× on Ivy Bridge) and Xeon Phi (13.7× on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimization solutions to upcoming architectures.
Application developers like to stick to the object-oriented style of programming by designing their application's logic as interaction between different object entities. In this process, every entity is modeled as a C++ class or structure. Array of structure (AoS) maintains a collection of those entities, which makes the code more readable and easier to maintain. But, this user-friendly code can potentially pose a challenge when it comes to vectorization efficiency. Often, the data needed for populating the vector register is gathered since the data is laid out in non-unit stride fashion in the main memory. To make the data layout more vector-friendly, developers often had to change their data structures manually from AoS to a structure of arrays (SoA). Single instruction multiple data (SIMD) layout templates from Intel help developers preserve an AoS interface while programming but, under the hood, the data structure is laid out in an SoA format. This is a win-win solution for both object-oriented and vector-friendly programming.
This presentation demonstrates how to analyze the memory access pattern in your performance-sensitive loops and how to enable the layout templates to make changes from constant- and variable-strided memory accesses to unit-strided memory access wherever possible.
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources (ReConFig2014@Cancun, Mexico)
flipSyrup, a new framework for rapid prototyping is proposed.
The document summarizes the use of LLVM for code generation when recompiling Nintendo games as native games. LLVM provides a full compiler infrastructure that can be used to generate code for various platforms from a common intermediate representation (LLVM bitcode). The document discusses using LLVM for code generation from 6502 assembly to generate native code for emulation. Optimizations available through LLVM are also discussed.
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,paperpublications3
Abstract: A public domain encryption standard is subject to continuous, vigilant, expert cryptanalysis. AES is a symmetric encryption algorithm processing data in block of 128 bits. Under the influence of a key, a 128-bit block is encrypted by transforming it in a unique way into a new block of the same size. To implement AES Rijndael algorithm on FPGA using Verilog and synthesis using Xilinx, Plain text of 128 bit data is considered for encryption using Rijndael algorithm utilizing key. This encryption method is versatile used for military applications. The same key is used for decryption to recover the original 128 bit plain text. For high speed applications, the Non LUT based implementation of AES S-box and inverse S-box is preferred. Development of physical design of AES-128 bit is done using cadence SoC encounter. Performance evaluation of the physical design with respect to area, power, and time has been done. The core consumes 10.11 mW of power for the core area of 330100.742 μm2.
Keywords: Encryption, Decryption Rijndael algorithm, FPGA implementation, Physical Design.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
This document describes PyCoRAM, a Python-based implementation of the CoRAM memory architecture for FPGA-based computing. PyCoRAM provides a high-level abstraction for memory management that decouples computing logic from memory access behaviors. It allows defining memory access patterns using Python control threads. PyCoRAM generates an IP core that integrates with standard IP cores on Xilinx FPGAs using the AMBA AXI4 interconnect. It supports parameterized RTL design and achieves high memory bandwidth utilization of over 84% on two FPGA boards in evaluations of an array summation application.
XDAQ is an open source ecosystem for controlling Arduino-based science experiments from a Linux environment. It includes development tools like Arduino IDE, Eclipse, and Python toolchains as well as libraries for data processing, analysis, visualization, and real-time streaming. XDAQ can be used as a virtual machine appliance or installed on Debian/Ubuntu systems. The ecosystem is released under the GPL license and source code is available on GitHub.
The document discusses research on improving OpenMP runtime support for multi-core platforms. The key contributions are:
1) Optimizing OpenMP tasking runtime for NUMA machines by maximizing local operations and minimizing remote data accesses.
2) Developing a fast work-stealing mechanism for task queues based on a combining synchronization technique.
3) Transforming nested parallel loops to tasks to improve efficiency over nested parallelism.
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
Join RISC-V BitManip industry leader Claire Xenia Wolf and Dr. James Cuff for an open and lively discussion with an interactive Q&A on RISC-V and BitManip including trends and comparisons with the existing architecture landscape including x86 and ARM and what specifically makes RISC-V unique.
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
The document discusses parallel program design and parallel programming techniques. It introduces parallel algorithm design based on four steps: partitioning, communication, agglomeration, and mapping. It also covers parallel programming tools including pthreads, OpenMP, and MPI. Common parallel constructs like private, shared, barrier, and reduction are explained. Examples of parallel programs using pthreads and OpenMP are provided.
This document provides an introduction to shellcoding, which involves exploiting software vulnerabilities to insert and execute a custom payload. It discusses prerequisites like assembly language and memory structure. Key topics covered include the program stack, calling conventions, system calls, and writing shellcodes to reference data, print to stdout, and execute a program. The document concludes by outlining the steps to create a reverse shell shellcode that connects back to an attacker's server.
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
1. GPUs have many more cores than CPUs and are very good at processing large blocks of data in parallel.
2. GPUs can provide a significant speedup over CPUs for applications that map well to a data-parallel programming model by harnessing the power of many cores.
3. The throughput-oriented nature of GPUs makes them well-suited for algorithms where the same operation can be performed on many data elements independently.
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
This document outlines the topics that will be covered in the course on massively parallel computing, including computational thinking skills for parallel programming, hardware limitations and constraints on algorithms, and common parallel programming patterns. The topics include thinking in parallel, computer architecture, programming models, theoretical concepts, and parallel programming patterns. The goal is to provide students with the skills needed to design efficient parallel algorithms that maximize performance on modern parallel hardware.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dissertation title and final project: Data source registration in the Virtual Laboratory. The subject of the thesis and related project was to integrate EGEE/WLCG data sources into GridSpace Virtual Laboratory (http://gs.cyfronet.pl/).
Poster presentation entitled Integrating EGEE Storage Services with the Virtual Laboratory:
http://www.plgrid.pl/en/pr_materials/posters
Dissertation available at http://virolab.cyfronet.pl/trac/vlvl#MasterofScienceThesesrelatedtoViroLab
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...The Linux Foundation
Unikernels are increasingly gaining traction as they provide lightweight, low-overhead, high-performance execution of applications, while keeping the high isolation guarantees of virtualization crucial to multi-tenant deployments. However, developers still lack tools to support their development, especially compared to the rich toolsets that full operating systems such as Linux or BSD provide.
In this talk, Florian will present uniprof, a unikernel profiler and performance debugger that gives developers insight into their unikernel behavior transparently, without having to instrument the unikernel itself. Uniprof works on both x86 and ARM, can profile even when frame pointers are unavailable, and can be used with visualization tools such as flame graphs. It incurs only minimal overhead (~0.1% at 100 samples/s) to the unikernel, making it ideal for profiling even on production systems.
This document discusses NNgen, a tool for generating neural network hardware implementations from TensorFlow models. NNgen takes a TensorFlow model as input, performs optimizations, and generates an FPGA implementation including a control unit, computing units, RAM blocks, and interconnects. It outputs RTL code and an IP-XACT description of the generated neural network hardware accelerator. Diagrams show an example convolutional layer implementation generated by NNgen, including weight and activation memory blocks, multiply-accumulate units, addition trees, and reuse of computation units via a substream pool.
The document discusses challenges in GPU compilers. It begins with introductions and abbreviations. It then outlines the topics to be covered: a brief history of GPUs, what makes GPUs special, how to program GPUs, writing a GPU compiler including front-end, middle-end, and back-end aspects, and a few words about graphics. Key points are that GPUs are massively data-parallel, execute instructions in lockstep, and require supporting new language features like OpenCL as well as optimizing for and mapping to the GPU hardware architecture.
A more deeper talk on the Transformer architecture from the webinar at NTR
https://www.ntr.ai/webinar/transformery
Google slides version: https://docs.google.com/presentation/d/1dIadh_nIszxXG8-672vJmvFGT6jBp0mOqzNV4g3e2Lc/edit?usp=sharing
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types LaterPositive Hack Days
Ведущий: Алексей Черепанов
Скорость взлома хешей растет. Растет и количество алгоритмов хеширования. Объем задач для поддержки универсального инструмента для взлома тоже увеличивается. В ответ на это был разработан john-devkit — улучшенный генератор кода к известному приложению для взлома паролей John the Ripper. john-devkit содержит более 100 типов хешей. Ведущий рассмотрит ключевые аспекты его использования: разделение алгоритмов, оптимизация и вывод данных для различных устройств, простое промежуточное представление алгоритмов хеширования, трудности оптимизации для человека и машины, bitslicing, сравнение скорости обработки.
Track A-Compilation guiding and adjusting - IBMchiportal
The document summarizes the Embedded Reconfigurable Architecture (ERA) project. The ERA project aims to develop an adaptive platform that can dynamically adjust hardware resources to meet changing performance and power needs. Key components include reconfigurable processing elements, memory hierarchies, and networks. The project involves 10 partners across academia and industry. Work focuses on compilers, operating systems, hardware scheduling, and exploiting tradeoffs between performance and power consumption.
The document discusses research on improving OpenMP runtime support for multi-core platforms. The key contributions are:
1) Optimizing OpenMP tasking runtime for NUMA machines by maximizing local operations and minimizing remote data accesses.
2) Developing a fast work-stealing mechanism for task queues based on a combining synchronization technique.
3) Transforming nested parallel loops to tasks to improve efficiency over nested parallelism.
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
Join RISC-V BitManip industry leader Claire Xenia Wolf and Dr. James Cuff for an open and lively discussion with an interactive Q&A on RISC-V and BitManip including trends and comparisons with the existing architecture landscape including x86 and ARM and what specifically makes RISC-V unique.
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
The document discusses parallel program design and parallel programming techniques. It introduces parallel algorithm design based on four steps: partitioning, communication, agglomeration, and mapping. It also covers parallel programming tools including pthreads, OpenMP, and MPI. Common parallel constructs like private, shared, barrier, and reduction are explained. Examples of parallel programs using pthreads and OpenMP are provided.
This document provides an introduction to shellcoding, which involves exploiting software vulnerabilities to insert and execute a custom payload. It discusses prerequisites like assembly language and memory structure. Key topics covered include the program stack, calling conventions, system calls, and writing shellcodes to reference data, print to stdout, and execute a program. The document concludes by outlining the steps to create a reverse shell shellcode that connects back to an attacker's server.
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
1. GPUs have many more cores than CPUs and are very good at processing large blocks of data in parallel.
2. GPUs can provide a significant speedup over CPUs for applications that map well to a data-parallel programming model by harnessing the power of many cores.
3. The throughput-oriented nature of GPUs makes them well-suited for algorithms where the same operation can be performed on many data elements independently.
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
This document outlines the topics that will be covered in the course on massively parallel computing, including computational thinking skills for parallel programming, hardware limitations and constraints on algorithms, and common parallel programming patterns. The topics include thinking in parallel, computer architecture, programming models, theoretical concepts, and parallel programming patterns. The goal is to provide students with the skills needed to design efficient parallel algorithms that maximize performance on modern parallel hardware.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dissertation title and final project: Data source registration in the Virtual Laboratory. The subject of the thesis and related project was to integrate EGEE/WLCG data sources into GridSpace Virtual Laboratory (http://gs.cyfronet.pl/).
Poster presentation entitled Integrating EGEE Storage Services with the Virtual Laboratory:
http://www.plgrid.pl/en/pr_materials/posters
Dissertation available at http://virolab.cyfronet.pl/trac/vlvl#MasterofScienceThesesrelatedtoViroLab
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...The Linux Foundation
Unikernels are increasingly gaining traction as they provide lightweight, low-overhead, high-performance execution of applications, while keeping the high isolation guarantees of virtualization crucial to multi-tenant deployments. However, developers still lack tools to support their development, especially compared to the rich toolsets that full operating systems such as Linux or BSD provide.
In this talk, Florian will present uniprof, a unikernel profiler and performance debugger that gives developers insight into their unikernel behavior transparently, without having to instrument the unikernel itself. Uniprof works on both x86 and ARM, can profile even when frame pointers are unavailable, and can be used with visualization tools such as flame graphs. It incurs only minimal overhead (~0.1% at 100 samples/s) to the unikernel, making it ideal for profiling even on production systems.
This document discusses NNgen, a tool for generating neural network hardware implementations from TensorFlow models. NNgen takes a TensorFlow model as input, performs optimizations, and generates an FPGA implementation including a control unit, computing units, RAM blocks, and interconnects. It outputs RTL code and an IP-XACT description of the generated neural network hardware accelerator. Diagrams show an example convolutional layer implementation generated by NNgen, including weight and activation memory blocks, multiply-accumulate units, addition trees, and reuse of computation units via a substream pool.
The document discusses challenges in GPU compilers. It begins with introductions and abbreviations. It then outlines the topics to be covered: a brief history of GPUs, what makes GPUs special, how to program GPUs, writing a GPU compiler including front-end, middle-end, and back-end aspects, and a few words about graphics. Key points are that GPUs are massively data-parallel, execute instructions in lockstep, and require supporting new language features like OpenCL as well as optimizing for and mapping to the GPU hardware architecture.
A more deeper talk on the Transformer architecture from the webinar at NTR
https://www.ntr.ai/webinar/transformery
Google slides version: https://docs.google.com/presentation/d/1dIadh_nIszxXG8-672vJmvFGT6jBp0mOqzNV4g3e2Lc/edit?usp=sharing
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types LaterPositive Hack Days
Ведущий: Алексей Черепанов
Скорость взлома хешей растет. Растет и количество алгоритмов хеширования. Объем задач для поддержки универсального инструмента для взлома тоже увеличивается. В ответ на это был разработан john-devkit — улучшенный генератор кода к известному приложению для взлома паролей John the Ripper. john-devkit содержит более 100 типов хешей. Ведущий рассмотрит ключевые аспекты его использования: разделение алгоритмов, оптимизация и вывод данных для различных устройств, простое промежуточное представление алгоритмов хеширования, трудности оптимизации для человека и машины, bitslicing, сравнение скорости обработки.
Track A-Compilation guiding and adjusting - IBMchiportal
The document summarizes the Embedded Reconfigurable Architecture (ERA) project. The ERA project aims to develop an adaptive platform that can dynamically adjust hardware resources to meet changing performance and power needs. Key components include reconfigurable processing elements, memory hierarchies, and networks. The project involves 10 partners across academia and industry. Work focuses on compilers, operating systems, hardware scheduling, and exploiting tradeoffs between performance and power consumption.
I think this presentation about Adapteva's Parallella is one of the most comprehensive till now. Feel free to use it. I gave this talk on 10th Dec 2014 at Cloud Research Lab, Ericsson AB, Lund, Sweden.
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
A floating-point unit (FPU) is a math coprocessor, a part of a computer system specially designed to carry
out operations on floating point numbers. The term floating point refers to the fact that the radix point can
"float"; that is, it can placed anywhere with respect to the significant digits of the number. Double
precision floating point, also known as double, is a commonly used format on PCs due to its wider range
over single precision in spite of its performance and bandwidth cost. This paper aims at developing the
verilog version of the double precision floating point core designed to meet the IEEE 754 standard .This
standard defines a double as sign bit, exponent and mantissa. The aim is to build an efficient FPU that
performs basic functions with reduced complexity of the logic used and also reduces the memory
requirement as far as possible.
The document appears to be a block of random letters with no discernible meaning or purpose. It consists of a series of letters without any punctuation, formatting, or other signs of structure that would indicate it is meant to convey any information. The document does not provide any essential information that could be summarized.
OSDC 2017 - Werner Fischer - Open power for the data centerNETWAYS
IBM's POWER (Performance Optimization With Enhanced RISC) architecture is known to run mission-critical applications and to provide bank-style "RAS" (Reliability, Availability, Serviceability) features since 1990. Opening the architecture in 2013 enabled other vendors like Tyan or Rackspace to build servers based on the current POWER8 edition of this architecture. The current POWER8 CPUs provide up to 12 cores with 8x Simultaneous Multithreading - leading to 96 threads per CPU. Up to eight memory channels enable up to 230 GB/s memory bandwidth per CPU. Increased L1, L2, L3 and new L4 caches help to boost the performance of memory-bound applications like databeses, by providing more than 1 TB/s of bandwidth. In this talk Werner will give an overview of the architecture and show the performance possibilities of POWER8, using the PostgreSQL database as an example. By comparing PostgreSQL 9.4, 9.5 and 9.6 benchmarking results he will visualize the increased efficiency thanks to PowergreSQL's optimizations for POWER over the last years. Finally, he will outline one other benefit of OpenPOWER systems: from the very beginning (the first instruction to initialize the first CPU core, long before DRAM, firmware management or PCIe works) up to running your Linux OS and application like a database, only open source code gets executed.
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
Nowadays system administrators have great choices when it comes down to Linux performance profiling and monitoring. The challenge is to pick the appropriate tools and interpret their results correctly.
This talk is a chance to take a tour through various performance profiling and benchmarking tools, focusing on their benefit for every sysadmin.
More than 25 different tools are presented. Ranging from well known tools like strace, iostat, tcpdump or vmstat to new features like Linux tracepoints or perf_events. You will also learn which tools can be monitored by Icinga and which monitoring plugins are already available for that.
At the end the goal is to gather reference points to look at, whenever you are faced with performance problems.
Take the chance to close your knowledge gaps and learn how to get the most out of your system.
OSDC 2017 | Open POWER for the data center by Werner FischerNETWAYS
The document discusses OpenPOWER for data centers. It begins with an overview of the POWER8 CPU architecture, including its multi-core design, simultaneous multi-threading capabilities, large caches, and cryptographic accelerators. It then covers the reliability, availability and serviceability features of OpenPOWER systems, such as error correction and retry mechanisms. The document concludes with a discussion of the open source firmware in OpenPOWER systems, including the multi-step boot process involving the Self Boot Engine, HostBoot, On-Chip Controller, Skiboot and Petitboot.
We all know how CPU hungry Ceph is. What if we could change our architecture using NVMeoF?
This talk explores a theoretical setup and was given originally at Ceph Day London 2019.
A Quick Introduction to Programmable LogicOmer Kilic
Slides from my talk on Programmable Logic at the Open Source Hardware Users Group Meeting #9 in London on the 21st of April 2011.
More details about the event at: http://oshug.org/event/9
An overview of the slides at:
http://omer.me/2011/05/a-quick-introduction-to-programmable-logic
This document provides an overview of embedded Linux for an embedded systems design course. It discusses various commercial and open source embedded Linux distributions and their characteristics. It also covers important topics for embedded Linux including tool chains, the Linux kernel, debugging, driver development, memory management, and synchronization techniques. Example code snippets are provided for common Linux system programming tasks like file I/O, processes, threads, IPC, signals, and sockets. Cross-compiling for embedded targets is also briefly explained.
The OpenCSD library for decoding CoreSight traces has reached the point where it is ready to be integrated into applications. This session will present an overview of the state of the library, its interfaces and explore and demonstrate a sample integration with perf.
This document provides an overview of part 2 of a course on specification languages. It discusses model based system design using SystemC. It introduces object oriented techniques for designing hardware systems and provides hands-on experience with SystemC. The material for part 2 includes slides, the SystemC language reference manual, and an exercise on building a functional model of a JPEG encoder/decoder in SystemC. It discusses key aspects of functional modeling in SystemC including modules, ports, processes, channels and the simulation engine.
Microkernel-based operating system developmentSenko Rašić
The document discusses microkernel-based operating system development. It describes how a microkernel has minimal functionality and moves drivers and services to user-level processes that communicate through inter-process communication calls. This can impact performance. Mainstream systems now take a hybrid approach. The document then describes the L4 microkernel and its implementation, Hasenpfeffer, which maximizes reuse of open source components. It lists components and features of the Hasenpfeffer system, including programming languages, drivers, and tools for development.
The document discusses bypassing address space layout randomization (ASLR) on Linux. It begins with a refresher on buffer overflows and modern protections like ASLR and DEP. It then explores finding fixed addresses in the .text section that are not subject to ASLR to redirect execution, such as calls and jumps to registers. The document shows searching binaries for these instruction sequences and checking register values to leverage them for exploiting a vulnerable program while ASLR is enabled.
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
This document discusses bypassing address space layout randomization (ASLR) protections to execute shellcode on the stack. It begins with an overview of stack-based buffer overflows and modern protections like non-executable stacks. It then describes using return-oriented programming (ROP) techniques like ret2libc to hijack control flow and call library functions like system() to spawn a shell. Specifically, it outlines overwriting a return address to call mprotect() to make the stack executable, then jumping to shellcode on the stack. The document provides example exploit code and steps to find needed addresses in memory.
Dataplane programming with eBPF: architecture and toolsStefano Salsano
eBPF is definitely a complex technology. Developing complex systems based on eBPF is challenging due to the intrinsic limitations of the model and the known shortcomings of the tool chain.
The learning curve of this technology is very steep and needs continuous coaching from experts. This tutorial will investigate:
What is eBPF and why it has gained a prominent position among the solutions to improve the packet processing performance in Linux/x86 nodes. We will shortly present some important use case scenarios for eBPF, like Kubernetes’ Cilium
The architecture of eBPF and its programming toolchain (e.g. bcc
What are the frameworks for eBPF programming, such as Polycube and InKeV.
How to make eBPF programming easier, more flexible and modular with HIKe/eCLAT
How to implement a custom application logic in eBPF with eCLAT using a python-like script
How to extend the framework and develop new modules
Similar to Parallella: Embedded HPC For Everybody (20)
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsScyllaDB
ScyllaDB monitoring provides a lot of useful information. But sometimes it’s not easy to find the root of the problem if something is wrong or even estimate the remaining capacity by the load on the cluster. This talk shares our team's practical tips on: 1) How to find the root of the problem by metrics if ScyllaDB is slow 2) How to interpret the load and plan capacity for the future 3) Compaction strategies and how to choose the right one 4) Important metrics which aren’t available in the default monitoring setup.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxSunil Jagani
Discover how AI is transforming the workplace and learn strategies for reskilling and upskilling employees to stay ahead. This comprehensive guide covers the impact of AI on jobs, essential skills for the future, and successful case studies from industry leaders. Embrace AI-driven changes, foster continuous learning, and build a future-ready workforce.
Read More - https://bit.ly/3VKly70
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMydbops
This presentation, titled "MySQL - InnoDB" and delivered by Mayank Prasad at the Mydbops Open Source Database Meetup 16 on June 8th, 2024, covers dynamic configuration of REDO logs and instant ADD/DROP columns in InnoDB.
This presentation dives deep into the world of InnoDB, exploring two ground-breaking features introduced in MySQL 8.0:
• Dynamic Configuration of REDO Logs: Enhance your database's performance and flexibility with on-the-fly adjustments to REDO log capacity. Unleash the power of the snake metaphor to visualize how InnoDB manages REDO log files.
• Instant ADD/DROP Columns: Say goodbye to costly table rebuilds! This presentation unveils how InnoDB now enables seamless addition and removal of columns without compromising data integrity or incurring downtime.
Key Learnings:
• Grasp the concept of REDO logs and their significance in InnoDB's transaction management.
• Discover the advantages of dynamic REDO log configuration and how to leverage it for optimal performance.
• Understand the inner workings of instant ADD/DROP columns and their impact on database operations.
• Gain valuable insights into the row versioning mechanism that empowers instant column modifications.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfleebarnesutopia
So… you want to become a Test Automation Engineer (or hire and develop one)? While there’s quite a bit of information available about important technical and tool skills to master, there’s not enough discussion around the path to becoming an effective Test Automation Engineer that knows how to add VALUE. In my experience this had led to a proliferation of engineers who are proficient with tools and building frameworks but have skill and knowledge gaps, especially in software testing, that reduce the value they deliver with test automation.
In this talk, Lee will share his lessons learned from over 30 years of working with, and mentoring, hundreds of Test Automation Engineers. Whether you’re looking to get started in test automation or just want to improve your trade, this talk will give you a solid foundation and roadmap for ensuring your test automation efforts continuously add value. This talk is equally valuable for both aspiring Test Automation Engineers and those managing them! All attendees will take away a set of key foundational knowledge and a high-level learning path for leveling up test automation skills and ensuring they add value to their organizations.
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Parallella: Embedded HPC For Everybody
1. Motivation Intro Memory Network Communication Measurements Parallella
Parallella: Embedded HPC For Everybody
Jacob Erlbeck
Sysmocom s.f.m.c. GmbH
Berlin
Softwarekonferenz für Parallel Programming, Concurrency
und Multicore-Systeme, Karlsruhe, 5.-7. Mai 2014
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
2. Motivation Intro Memory Network Communication Measurements Parallella
The Parallella
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
3. Motivation Intro Memory Network Communication Measurements Parallella
The Parallella (2)
It’s cool!
Credit card size
Co-processors of multiple boards can be linked
Inexpensive
Software and design files are Open Source (github)
GCC / GDB / GNU tool chain
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
4. Motivation Intro Memory Network Communication Measurements Parallella
What did I want to know?
Suitable for ...
Audio processing?
Software defined radio?
Stream analysis?
Real performance values
How much of the peak performance rates do I get?
How does it compare to other platforms (Dual Cortex A9)?
What else?
Is the system easy or difficult to use or understand?
Are there helpful libraries or frameworks?
Which tools are available?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
5. Motivation Intro Memory Network Communication Measurements Parallella
Example
The example problem
Input matrix I
· · · · · ·
· d n d · ·
· n c n · ·
· d n d · ·
· · · · · ·
Oi,j = dIi−1,j−1 + nIi,j−1 + dIi+1,j−1 +
nIi−1,j + cIi,j + nIi+1,j +
dIi−1,j+1 + nIi,j+1 + dIi+1,j+1
1 = c + 4d + 4n
Apply a 3 × 3 stencil filter to a 1000 × 1000 matrix
These are 998 * 998 * 9 multiplications and summations
On 16 cores á 600 MHz with fused multiply-add this should
take 0.9 ms for the FPU
This problem is being described in Brown Deer Technology’s STDCL documentation for the Parallella, see
www.browndeertechnology.com/docs/app_note_programming_parallella_using_stdcl.pdf
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
6. Motivation Intro Memory Network Communication Measurements Parallella
Software
Programming frameworks
Preinstalled
Epiphany specific libraries
e-lib Target library, access to registers and hardware
units, context information, utilities
e-hal Host library, access to the co-processor, loading
and starting kernels
newlib Port of libc/libm that runs on the co-processor
Generic frameworks
libcoprthr POSIX like threading abstraction for co-processors
OpenCL Compiler and libraries
STDCL Simplified layer on top of the above (host side)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
7. Motivation Intro Memory Network Communication Measurements Parallella
Software
Programming frameworks
Preinstalled (2)
Tools
GNU Tools GNU suite of bin utils and compilers: e-gcc/e-g++,
e-nm, e-objdump, ...
e-server Remote GDB debugging proxy for Epiphany cores
e-run Single core emulator, supports tracing & profiling
e-gdb GDB for the Epiphany, remote and emulation
e-tools Load programs, read/write core data, reset cores
OS
Linaro Linaro 14.01 / Ubuntu ’Saucy’ 13.10
The tools and libraries can be build and used on standard
computers, e.g. for cross-compiling and emulation
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
8. Motivation Intro Memory Network Communication Measurements Parallella
Implementation
Example Implementation (STDCL)
Host Part
stencil2d_host.c (snippet), similar to the STDCL app note)
int w=1000; int h=1000;
float d=0.01/8; float n=d; float c=0.99;
size_t size = sizeof(float)*w*h;
float* in = clmalloc(stdacc, size, 0);
float* out = clmalloc(stdacc, size, 0);
// initialize ndr, in, out, ctx
clmsync(ctx, 0, in, ...);
clmsync(ctx, 0, out, ...);
clexec(ctx, 0, &ndr, stencil2d_kern,
in, out, w, h, c, n, d);
clmsync(ctx, 0, out, ...);
clwait(ctx, 0, CL_ALL_EVENT);
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
9. Motivation Intro Memory Network Communication Measurements Parallella
Implementation
Example Implementation (OpenCL)
Co-processor Kernel
stencil2d_kern.cl (snippet), similar to the STDCL note
void stencil2d_kern(
float* in, float* out, int w, int h,
float c, float n, float d)
{ // initialize x1, x2, y1, y2 based on core id
for(y=y1; y<y2; y++)
for(x=x1; x<x2; x++) {
int k = x+y*w;
out[k] =
d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] +
n*in[k-1] + c*in[k] + n*in[k+1] +
d*in[k+w-1] + n*in[k+w] + d*in[k+w+1];
}
}
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
10. Motivation Intro Memory Network Communication Measurements Parallella
Testing
Trying it out
Let’s see ...
linaro@linaro-nano: stencil/stdcl/
$ clcc -k -o kernel.o -c stencil2d_kern.cl
$ gcc -o stencil2d.x stencil2d_host.c kernel.o
$ sudo ./stencil2d.x
time used 1.184 s
$ ./stencil2d.x -stdcpu
time used 0.281 s
$
Oops !
99.9 % of the time is not being used for floating point ops!
It’s 4 times faster to use the ARM CPUs than the Epiphany!
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
11. Motivation Intro Memory Network Communication Measurements Parallella
Testing
What to do
What is happening here ?
Questions
What are 99.9 % of the time spent for?
How can we fix it?
Next steps
Do measurements
Look at the board’s architecture
Try to improve the test programm accordingly
Iterate ...
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
12. Motivation Intro Memory Network Communication Measurements Parallella
Testing
Measurements (1)
Setup vs. computation
Modifications
Measure setup and computation time separately
Original kernel (times in ms)
Host Epiphany TE /TH
Set up 252 388 150%
Computation 32 773 2410%
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
13. Motivation Intro Memory Network Communication Measurements Parallella
Parallella Architecture
Parallella Architecture Overview
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
14. Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Epiphany Architecture Overview
MIMD Architecture
On-chip 2D mesh network
One shared 4 GiB address space (except for the first 1
MiB)
16 and 64 core versions available in silicon
256, 1024, and 4095 core versions offered as IP
Multiple devices can be linked together via 4 eLinks
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
15. Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Single Epiphany Node
Overview
Components
eCore processor
32kiB SRAM memory
Mesh network interface
2 DMA controller
2 event counter
Data busses 64 bit wide
Address busses 32 bit wide
Network bus 104 bit wide
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
16. Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Single Epiphany node
Processor (eCore)
Processor features
RISC architecture
Load/Store of 8, 16, 32, and 64 bit words
64 general purpose 32 bit registers
ALU/FPU: 32 bit only
No SIMD instructions
All registers are also memory mapped
Instruction pipeline (5 - 8 stages)
RISC: ALU/FPU only operate on registers, memory access
is only done via load/store instructions
Pipeline stalls until all register dependencies are fulfilled
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
17. Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Single Epiphany node
Local RAM
RAM features
32 kiB SRAM
Organized in 4 × 8 kiB banks that can be accessed in
parallel
Used for code and data
Access in 1 clock cycle
External memory can be used for code and data
No cache for external memory
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
18. Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Memory Model
Memory address ranges
Node local 0x00000 – 0xFFFFF 1 MiB
Local SRAM 0x00000 – 0x07FFF 32 kiB
Local Registers 0xF0000 – 0xFFFFF 64 kiB
External DRAM 0x8E000000 – 0x8FFFFFFF 32 MiB
Map local to global addresses
Row, 6 bit Col, 6 bit Local address, 20 bit
Set user interrupt on core (33, 11) → core id 0x84B
*(unsigned *)( (0x84B << 20) | 0xF042C ) = 0x20
Read external DRAM at offset 0x1234
val = *(unsigned *)(0x8E000000 + 0x1234)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
19. Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Memory Model
Default memory layout of the Parallella as seen from each node
Row Column
0 · · · 8 9 10 11 · · · 32 · · · 63
0 Local
0x00000000
−→ Global address
...
... (External cores via NORTH link)
32 Core 0,0
0x80800000
Core 0,1
0x80900000
Core 0,2
0x80A00000
Core 0,3
0x80B00000
33 Core 1,0
0x84800000
Core 1,1
0x84900000
Core 1,2
0x84A00000
Core 1,3
0x84B00000
34 Core 2,0
0x88800000
Core 2,1
0x88900000
Core 2,2
0x88A00000
Core 2,3
0x88B00000
· · · Ext DRAM
0x8E000000
35 Core 3,0
0x8C800000
Core 3,1
0x8C900000
Core 3,2
0x8CA00000
Core 3,3
0x8CB00000
... (External cores via SOUTH link)
...
63
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
20. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
External memory accesses
void stencil2d_kern(
float* in, float* out, int w, int h,
float c, float n, float d)
{
for(y=y1; y<y2; y++)
for(x=x1; x<x2; x++) {
int k = x+y*w;
out[k] =
d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] +
n*in[k-1] + c*in[k] + n*in[k+1] +
d*in[k+w-1] + n*in[k+w] + d*in[k+w+1];
}
}
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
21. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Implementing caching
Modifications
Use local SRAM to cache 3 rows at a time — this reduces
external float reads from 9 to 1 per output value
Use register variables
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
Access to the external memory is the bottleneck
Is there still room for improvements?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
22. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Properties
Separate Networks
cMesh Write data on-chip (fast, async)
rMesh Read data (slow, high latency)
xMesh Write data from/to external devices (DRAM, async)
Transaction per clock cycle: 64 bit data
External transaction per clock cycle: 8 bit (16 bit peak) data
DRAM accesses do not disturb on-chip write transactions
Back-pressure (push-back) on congestion
Read-after-write can return the old value
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
23. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Transactions
Messages
Write Indication to write data atomically (includes data
and destination address)
Read Request to create a write transaction (includes
source and destination address)
Testset Request for atomic TESTSET (includes data,
source and destination address)
Data size is 8, 16, 32, or 64 bit
Data is read/written atomically
Messages include control bits (routing mode, interrupt,
end-of-block)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
24. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Routing
Standard routing algorithm
1 If the column does not match: route horizontally
2 If the row does not match: route vertically
3 Both match: route to the attached core
Other routing methods can be selected at the sending core
External memory is accessed as if it were cores
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
25. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Routing examples
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
26. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Using 64 bit accesses
Modifications
Read and write 2 floats (à 32 bit) at a time
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
+ 64bit 50 76 152% 237%
Using 64bit transactions is more efficient
Is there still room for improvements?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
27. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
Read transactions
eCore eLink FIFO/AXI
FPGA
extMem
DRAM
read_req
read_req
mem read
write_ind
write_ind
msc External load
’load’ operations stall the eCore until the data has arrived
This adds latency
The read requests share bandwidth with write indications
This can reduce (write) throughput
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
28. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
Single Epiphany node
DMA controller
DMA controller features
2 independant DMA controllers
Configurations can be chained per controller
Basically implements ... (without chaining)
do_dma(*dst, dinc[2], *src, sinc[2], count[2])
{
for (o = count[0]; o > 0; o--)
for (i = count[1]; ; i--)
*(item_t *)dst = *(item_t *)src;
if (i == 1) break;
dst += dinc[1]; src += sinc[1];
dst += dinc[0]; src += sinc[0];
}
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
29. Motivation Intro Memory Network Communication Measurements Parallella
Architecture
DMA Transactions
eCore DMA eLink FIFO/AXI
FPGA
extMem
DRAM
dma_start
read_req
read_req
mem read
mem read
write_ind
write_ind
mem read
write_ind
write_ind
mem write
write_ind
buf_ex
loop
msc Using the DMA
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
30. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Using the DMA controller
Modifications
Port the example to the e-lib
Use the DMA to read/write rows asynchronously
linaro@linaro-nano: stencil/elib/
$ gcc -Wall -o stencil.o -c stencil.c
$ gcc -Wall -o stencil stencil.o -le-hal
$ e-gcc -Wall -O3 -ffast-math -c kern.c -o kern.o
$ e-gcc -T fast.ldf kern.o -o kern.elf -le-lib
$ e-objcopy -srec-forceS3 -output-target srec
kern.elf kern.srec
$ sudo ./stencil_host -K kern -R100
time used 5.130 s (0.219 s + 100 * 0.049 s)
$
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
31. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Using the DMA controller (2)
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
+ 64bit 50 76 152% 237%
Set-up — 220 — — %
+ DMA — 49 — 153%
Using the DMA avoids stalling the eCore
Is there still room for improvements?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
32. Motivation Intro Memory Network Communication Measurements Parallella
External Memory
Read/Write Throughput From/To External Memory
By number of cores
4 8 12 16
50
100
150
200
250
Cores
Throughput(MB/s)
Write 4096
(’slow’ DMA)
Write 256
Read 4096
Read + Write
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
33. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Reduce transaction rate
Modifications
Use DMA ’slow’ mode (use outer loop, increases
transaction interval)
Measurements (times in ms)
Host Epiphany TE /THmin
FPU-Rate
DMA ’fast’ — 49 153% 2%
DMA ’slow’ — 44 137% 2%
DMA write-only — 26 (81%) 3%
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
34. Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Further improvements?
Measurement overview (times in ms)
Time (ms) Clk/float FPU-Rate
Set up 220 - 388 — —
No caching 773 7240 —
Caching 152 1459 —
Registers 129 1238 —
64bit 50 480 —
DMA ’fast’ 49 470 2 %
DMA ’slow’ 44 422 2 %
Local mem 2.2 21 38 %
Stencil problem is too ’small’ per float (uses 5 % only)
2.4ms * 0.38% FPU instruction rate ≈ estimated 0.9ms
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
35. Motivation Intro Memory Network Communication Measurements Parallella
More Measurements
Measuring Throughput
Overview (16 cores, 600 MHz clock)
MB/s clock/float Peak Remarks
Read DRAM 139 275 46 % sync, slow
Write DRAM 152 252 50 % sync, slow
R/W DRAM 2 × 86 441 57 % sync
Read (0,3) 519 74 — sync, slow
Write (0,3) 4802 8 100 % no sync, loop
Read next 5952 6.5 — no sync
Write next 21796 1.7 28 % no sync
Stencil DRAM 2 × 91 422 60 %
Stencil local 1774 21 43 %
Reached throughput of 91MB/s ≈ measured max R/W rate
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
36. Motivation Intro Memory Network Communication Measurements Parallella
More Measurements
Raw Read/Write Throughput From/To External Memory
On continious overload, throughput differs
The rows don’t affect each other
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
37. Motivation Intro Memory Network Communication Measurements Parallella
Results
Learned From Working With the Example
Set-up time Prefer long running kernels (»200ms)
DRAM Avoid accessing the external memory twice for the
same data, cache locally instead
Write only Avoid reading remote data (latency, throughput)
DMA read Use the DMA asynchronously to read data
FPU Be computation intensive per external data value
Compiler Using register variables and optimization options
(-O3 –ffast-math) yields good results
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
38. Motivation Intro Memory Network Communication Measurements Parallella
Results
Applications
Audio processing
≈ 200 channels à 24 bit at 96 kHz sampling rate (in & out)
Less if external DRAM is needed for delay lines
Video processing
≈ 1 stream 720p HD, 16 bit/pixel at 46 bps (in & out) / at
80 bps (out)
Stream analysis
≈ 4 GBit/s data stream (in only)
Less if external DRAM is needed to store data
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
39. Motivation Intro Memory Network Communication Measurements Parallella
Results
SW-Architecture Considerations
Throughput is limited
Can be an issue with star topologies
Starvation on network overload
Does not compromise throughput
Can be an issue with work-stealing scheduling
Barriers can help to ensure fairness
Reading is slower than writing
Throughput is highly asymmetric (by design)
Can be an issue with shared memory synchronization and
reference counting
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
40. Motivation Intro Memory Network Communication Measurements Parallella
Results
Application To Other Architectures
Write vs. Read
Where reading is also slower
Main memory accesses in current mainstream CPUs
NUMA interconnects
Access to IO devices, e.g. via PCIe
Access to remote data in clusters
Considerations
Prefer sending copies over sharing local data
Use asynchronous messaging
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
41. Closing Measurements
Finally
Thank you
Many thanks to
Sylvain Munaut for lending me a board
Links
Parallella http://www.parallella.org
Epiphany Docs http://www.adapteva.com/all-documents/
http://www.adapteva.com/analyst-reports/
Specifications http://www.parallella.org/board/
STDCL/Coprthr http://www.browndeertechnology.com/resources.htm
STDCL App Note http://www.browndeertechnology.com/docs/app_note_programming_
parallella_using_stdcl.pdf
MPR article http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
42. Closing Measurements
Finally
Contact
Jacob Erlbeck, jacob.erlbeck@gmail.com
Copyright
c Jacob Erlbeck, 2014. Please contact the author if you wish to redistribute the work as a whole or in parts.
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
43. Closing Measurements
Write Throughput To External Memory
By number of cores and size
4
8
12
16
1,000
2,000
3,000
4,000
100
150
200
250
Cores
Chunk size (B)
Throughput(MB/s)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
44. Closing Measurements
Throughput By Transfer Method
Overview (16 cores, 600 MHz clock)
eLib Loop DMA DMA DMA DMA Peak
slow sync slow, sync (spec.)
Read DRAM 48 94 131 139 128 136 300/1200
Write DRAM 77 152 143 152 140 149 300/1200
Read (0,3) 270 536 471 501 485 519 —
Write (0,3) 2405 4802 4528 4757 3787 3508 4800
Read col 3 1069 2134 1873 1997 1796 1899 —
Write col 3 7779 7651 16522 9807 9624 7894 19200
Read next 1750 2834 5952 5414 5076 4616 —
Write next 8749 7651 21769 9807 14571 7895 (76800)
Read self 2275 3485 6530 5439 5602 4501 —
Write self 8785 7651 21758 9807 14628 7895 (76800)
According to the errata lists in the Epiphany III/IV data sheets (E16G301 and E64G401), the peak node→eMesh
rate is currently limited.
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
47. Closing Measurements
Results
Measurement Summary
On-Chip Mesh Behaviour
Reading is much slower than writing
Overload can lead to core starvation
Overall throughput is maintained on overload
Cores can receive a constant data stream at peak rate
Cores can only send significantly below the peak rate
(errata)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai