Exploiting parallelism opportunities in non-parallel architectures to improve...GreenLSI Team, LSI, UPM
This document discusses improving software implementations of non-linear feedback shift registers (NLFSRs) through parallelism. It presents two approaches - one based on lookup tables (LUTs) and one based on algebraic normal forms (ANFs). The goal is to automatically generate different implementations to introduce variability and improve resistance against side-channel attacks. Experimental results on a KeeLoq implementation for MSP430 show that applying optimizations to the ANF-based approach improved performance by 2.45x in cycles compared to a baseline one-bit at a time implementation, though code size grew by 2.27x.
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsHeroku
Librato's CTO Joseph Ruscio took to the Waza 2013 stage to present "Instrumenting Twelve-Factor Apps". For more from Ruscio ping him at @josephruscio. For more on Waza visit http://waza.heroku.com/2013.
For Waza videos stay tuned at http://blog.heroku.com or visit http://vimeo.com/herokuwaza
Embedded Recipes 2017 - Reliable monitoring with systemd - Jérémy RosenAnne Nicolas
Embedded systems are autonomous. This simple fact is a driving force in the design of embedded systems which cannot afford the luxury of an operator to press a reset button or even a remote sysadmin to check what happened. Monitoring an application in an embedded system is a complex problem that must deal with the various ways an application can fail, detect them and restart the application if need be.
Systemd provides a comprehensive toolbox for the embedded developer to diagnose, monitor and restart the main application of an embedded system. Especially if the embedded application is a black-box software. This talk will review the tools provided by systemd for process monitoring and discuss how to easily deploy them in an embedded system.
Jérémy Rosen – Smile-Embedded and connected systems
The RTX kernel is a royalty-free real-time operating system designed for ARM and Cortex-M devices. It allows programs to perform multiple functions simultaneously using parallel tasks. The RTX kernel provides functions to create and manage concurrent tasks, prioritize tasks, and initialize the kernel. Some basic RTX kernel functions include os_sys_init to initialize the kernel, os_tsk_create to create tasks, and os_tsk_delete to terminate tasks.
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.
BKK16-203 Irq prediction or how to better estimate idle timeLinaro
Review design. The current approach to predict the idle time duration is based on statistics on the previous idle time durations. The presentation will show the weaknesses of this approach and how by tracking the irq behavior we predict the next event to guess estimate the idle duration.
The document discusses run-time environments and activation records. It explains that activation records are used to manage information for each procedure call and are allocated on the stack. Activation records contain fields for return values, parameters, local variables, and more. When a procedure is called, its activation record is pushed onto the stack and popped off when it returns. Activation records allow recursive calls by creating a new record each time a procedure is activated.
TimTrack is a software for tracking charged particles. It uses C language for its speed and flexibility. Previous versions used LAPACK or Intel IPP libraries for linear algebra operations. The newest version TimTrack v2.0 uses LAPACK and is 23.6 seconds faster than earlier versions for tracking 1 million particles. Future plans include parallelizing with OpenMP, MPI, and implementing on GPUs using CUDA.
Exploiting parallelism opportunities in non-parallel architectures to improve...GreenLSI Team, LSI, UPM
This document discusses improving software implementations of non-linear feedback shift registers (NLFSRs) through parallelism. It presents two approaches - one based on lookup tables (LUTs) and one based on algebraic normal forms (ANFs). The goal is to automatically generate different implementations to introduce variability and improve resistance against side-channel attacks. Experimental results on a KeeLoq implementation for MSP430 show that applying optimizations to the ANF-based approach improved performance by 2.45x in cycles compared to a baseline one-bit at a time implementation, though code size grew by 2.27x.
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsHeroku
Librato's CTO Joseph Ruscio took to the Waza 2013 stage to present "Instrumenting Twelve-Factor Apps". For more from Ruscio ping him at @josephruscio. For more on Waza visit http://waza.heroku.com/2013.
For Waza videos stay tuned at http://blog.heroku.com or visit http://vimeo.com/herokuwaza
Embedded Recipes 2017 - Reliable monitoring with systemd - Jérémy RosenAnne Nicolas
Embedded systems are autonomous. This simple fact is a driving force in the design of embedded systems which cannot afford the luxury of an operator to press a reset button or even a remote sysadmin to check what happened. Monitoring an application in an embedded system is a complex problem that must deal with the various ways an application can fail, detect them and restart the application if need be.
Systemd provides a comprehensive toolbox for the embedded developer to diagnose, monitor and restart the main application of an embedded system. Especially if the embedded application is a black-box software. This talk will review the tools provided by systemd for process monitoring and discuss how to easily deploy them in an embedded system.
Jérémy Rosen – Smile-Embedded and connected systems
The RTX kernel is a royalty-free real-time operating system designed for ARM and Cortex-M devices. It allows programs to perform multiple functions simultaneously using parallel tasks. The RTX kernel provides functions to create and manage concurrent tasks, prioritize tasks, and initialize the kernel. Some basic RTX kernel functions include os_sys_init to initialize the kernel, os_tsk_create to create tasks, and os_tsk_delete to terminate tasks.
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.
BKK16-203 Irq prediction or how to better estimate idle timeLinaro
Review design. The current approach to predict the idle time duration is based on statistics on the previous idle time durations. The presentation will show the weaknesses of this approach and how by tracking the irq behavior we predict the next event to guess estimate the idle duration.
The document discusses run-time environments and activation records. It explains that activation records are used to manage information for each procedure call and are allocated on the stack. Activation records contain fields for return values, parameters, local variables, and more. When a procedure is called, its activation record is pushed onto the stack and popped off when it returns. Activation records allow recursive calls by creating a new record each time a procedure is activated.
TimTrack is a software for tracking charged particles. It uses C language for its speed and flexibility. Previous versions used LAPACK or Intel IPP libraries for linear algebra operations. The newest version TimTrack v2.0 uses LAPACK and is 23.6 seconds faster than earlier versions for tracking 1 million particles. Future plans include parallelizing with OpenMP, MPI, and implementing on GPUs using CUDA.
Higher-order finite-volume methods for solving conservation laws can achieve high arithmetic intensity (AI) and improved performance. Theoretical analysis showed that 6th and 8th order methods reach the target AI for modern machines with infinite cache. Measurements of AI using hardware counters on a IBM Blue Gene/Q supercomputer matched the theoretical predictions when using multi-dimensional cache blocking. However, 3D blocking requires too much cache space due to wide halos from higher-order stencils. Iterating rectangular blocks in columns reduces cache usage and allows 6th and 8th order methods to achieve high AI with realistic cache sizes.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
This document provides an overview of using ClickHouse and Grafana for DNS analytics. Some key points:
- ClickHouse is a column-oriented database that is fast, scalable, and easy to use for analytics on large datasets like DNS logs.
- Grafana is used to visualize the DNS data by connecting it as a data source to ClickHouse.
- Examples show querying ClickHouse to analyze DNS data and identify top clients by ASN, response types, and flag combinations. Visualizations like histograms are also demonstrated.
- The installation process outlines adding the ClickHouse and Grafana repositories, installing the packages, and configuring the ClickHouse data source plugin for Grafana.
This document discusses using cReComp to develop ROS-compliant FPGA components. cReComp is a tool that takes specifications written in scrp and generates FPGA IP cores and C++ driver code. An example is presented where cReComp is used to generate a FIR filter component from a scrp specification. The component communicates with ROS using topics and processes data in real-time on the FPGA to provide latency of less than 1ms. Details are provided on the component architecture generated by cReComp and how it integrates FPGA hardware acceleration with the ROS framework.
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.
Registers can store multiple bits and are used for temporary storage in a processor. Flip-flops can only store one bit, so registers are needed for tasks like storing 32-bit integers. Registers are faster and more convenient than main memory. Having more registers can help speed up complex calculations. The document then discusses different types of shift registers and how a basic 4-bit register is implemented using D flip-flops.
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
The talk motivates the use of data stream processing technology for different aspects of continuous data computation, beyond "real-time" analysis, to incorporate historical data computation, reliable application logic and interactive analysis.
Experiments were conducted to evaluate the impact of virtualization on an RTOS by measuring its overheads and latencies when run natively and in a virtual machine. A real-time Linux system was used as the host OS with KVM/Qemu virtualization and Litmus^RT as the guest RTOS. Performance degradation in the virtualized RTOS was found due to the emulation of I/O interrupts by the virtual machine monitor and scheduling of virtual machine processes by the host OS.
We will explain the purpose of the PMWG farm and the current goals we have (e.g. collect power measurements, share reference platforms, monitor power trends of the kernel). We will also address the limitations of our farm and invite everyone to discuss which results should be displayed for further analysis.
This document describes the gate-level synthesis of a FIFO design using Synopsys Design Compiler. It discusses the FIFO description, introduces Design Compiler and the libraries used. It then outlines the steps to set up Design Compiler and synthesize the design, including specifying libraries, reading the HDL file, setting constraints, and compiling. Timing and reference reports are generated and the synthesized netlist is written out.
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.
http://dl.acm.org/citation.cfm?id=2745876
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
Presentation given by Tim Walsh at Archivematica Camp Baltimore 2018 about his and the Canadian Center for Architecture's experience with the Archivematica Automation Tools.
Streamlining pipeline execution for large scale RNA-Seq analysisDeepak Purushotham
This document describes a pipeline for streamlining RNA-seq execution on high-performance clusters. The pipeline involves importing read files for each sample, performing rRNA filtering, Bowtie alignment, chromosome filtering, transcriptome mapping, mapping quality control, HTSeq counting, and read quality control. It results in a 75% reduction in read files size through gzip compression and removes temporary files to greatly reduce input/output levels for efficient cluster processing.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
The document discusses the basics of RISC instruction set architectures and pipelining in CPUs. It begins by describing properties of RISC ISAs, including that operations apply to full registers, only load/store instructions affect memory, and instructions are typically one size. It then describes different types of RISC instructions like ALU, load/store, and branches. The document goes on to explain the implementation of a RISC pipeline in 5 stages and the concept of pipelining to improve CPU performance by overlapping instruction execution. It also discusses potential hazards that can degrade pipeline performance like structural, data, and control hazards.
Higher-order finite-volume methods for solving conservation laws can achieve high arithmetic intensity (AI) and improved performance. Theoretical analysis showed that 6th and 8th order methods reach the target AI for modern machines with infinite cache. Measurements of AI using hardware counters on a IBM Blue Gene/Q supercomputer matched the theoretical predictions when using multi-dimensional cache blocking. However, 3D blocking requires too much cache space due to wide halos from higher-order stencils. Iterating rectangular blocks in columns reduces cache usage and allows 6th and 8th order methods to achieve high AI with realistic cache sizes.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
This document provides an overview of using ClickHouse and Grafana for DNS analytics. Some key points:
- ClickHouse is a column-oriented database that is fast, scalable, and easy to use for analytics on large datasets like DNS logs.
- Grafana is used to visualize the DNS data by connecting it as a data source to ClickHouse.
- Examples show querying ClickHouse to analyze DNS data and identify top clients by ASN, response types, and flag combinations. Visualizations like histograms are also demonstrated.
- The installation process outlines adding the ClickHouse and Grafana repositories, installing the packages, and configuring the ClickHouse data source plugin for Grafana.
This document discusses using cReComp to develop ROS-compliant FPGA components. cReComp is a tool that takes specifications written in scrp and generates FPGA IP cores and C++ driver code. An example is presented where cReComp is used to generate a FIR filter component from a scrp specification. The component communicates with ROS using topics and processes data in real-time on the FPGA to provide latency of less than 1ms. Details are provided on the component architecture generated by cReComp and how it integrates FPGA hardware acceleration with the ROS framework.
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.
Registers can store multiple bits and are used for temporary storage in a processor. Flip-flops can only store one bit, so registers are needed for tasks like storing 32-bit integers. Registers are faster and more convenient than main memory. Having more registers can help speed up complex calculations. The document then discusses different types of shift registers and how a basic 4-bit register is implemented using D flip-flops.
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
The talk motivates the use of data stream processing technology for different aspects of continuous data computation, beyond "real-time" analysis, to incorporate historical data computation, reliable application logic and interactive analysis.
Experiments were conducted to evaluate the impact of virtualization on an RTOS by measuring its overheads and latencies when run natively and in a virtual machine. A real-time Linux system was used as the host OS with KVM/Qemu virtualization and Litmus^RT as the guest RTOS. Performance degradation in the virtualized RTOS was found due to the emulation of I/O interrupts by the virtual machine monitor and scheduling of virtual machine processes by the host OS.
We will explain the purpose of the PMWG farm and the current goals we have (e.g. collect power measurements, share reference platforms, monitor power trends of the kernel). We will also address the limitations of our farm and invite everyone to discuss which results should be displayed for further analysis.
This document describes the gate-level synthesis of a FIFO design using Synopsys Design Compiler. It discusses the FIFO description, introduces Design Compiler and the libraries used. It then outlines the steps to set up Design Compiler and synthesize the design, including specifying libraries, reading the HDL file, setting constraints, and compiling. Timing and reference reports are generated and the synthesized netlist is written out.
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.
http://dl.acm.org/citation.cfm?id=2745876
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
Presentation given by Tim Walsh at Archivematica Camp Baltimore 2018 about his and the Canadian Center for Architecture's experience with the Archivematica Automation Tools.
Streamlining pipeline execution for large scale RNA-Seq analysisDeepak Purushotham
This document describes a pipeline for streamlining RNA-seq execution on high-performance clusters. The pipeline involves importing read files for each sample, performing rRNA filtering, Bowtie alignment, chromosome filtering, transcriptome mapping, mapping quality control, HTSeq counting, and read quality control. It results in a 75% reduction in read files size through gzip compression and removes temporary files to greatly reduce input/output levels for efficient cluster processing.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
The document discusses the basics of RISC instruction set architectures and pipelining in CPUs. It begins by describing properties of RISC ISAs, including that operations apply to full registers, only load/store instructions affect memory, and instructions are typically one size. It then describes different types of RISC instructions like ALU, load/store, and branches. The document goes on to explain the implementation of a RISC pipeline in 5 stages and the concept of pipelining to improve CPU performance by overlapping instruction execution. It also discusses potential hazards that can degrade pipeline performance like structural, data, and control hazards.
The document discusses RISC instruction set basics and pipelining concepts. It begins by describing properties of RISC architectures, including that operations apply to full registers and only load/store instructions affect memory. It then describes different types of RISC instructions like ALU, load/store, and branches. The document goes on to explain the implementation of instructions in a MIPS64 pipeline with 5 stages: instruction fetch, decode/register fetch, execute, memory access, and write-back. It concludes by defining pipelining and describing how it can increase throughput by overlapping instruction execution.
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CanSecWest
This document discusses using Intel Processor Trace (Intel PT) for hardware-based tracing on Windows. It provides an overview of Intel PT capabilities and how it can be used for fuzzing and vulnerability discovery. Specifically, it describes the development of WinAFL IntelPT, which integrates Intel PT tracing with the WinAFL evolutionary fuzzer to enable high-performance, hardware-driven fuzzing on Windows.
KSpeculative aspects of high-speed processor designssuser7dcef0
Highest total system speed
Ex. TOP500 speed,
Application speed of supercomputers
Highest processor chip performance
Ex. SPEC CPU rate
NAS parallel benchmarks
Highest single core performance
SPEC CPU int,
SPEC CPU fp
Dhrystone
This document discusses instruction pipelining in computer processors. It begins by defining pipelining and explaining how it works like an assembly line to increase throughput. It then discusses different types of pipelines and introduces the MIPS instruction pipeline as an example. The document goes on to explain different types of pipeline hazards like structural hazards, control hazards, and data hazards. It provides examples of how to detect and resolve these hazards through techniques like forwarding, stalling, predicting, and delayed branching. Key concepts covered include pipeline registers, control signals, forwarding units, and branch prediction buffers.
1. The document discusses research activities related to reducing energy consumption by at least 30% through the development of core source technologies for universal operating systems.
2. It describes four papers being presented, including ones on system and device latency modeling, power management frameworks for embedded systems, and automatic selection of power policies for operating systems.
3. It also summarizes four research topics from the National University, including performance evaluation of parallel applications using a power-aware paging method on next-generation memory architectures.
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
This document describes MemGuard, an operating system mechanism for providing efficient per-core memory performance isolation on commercial off-the-shelf hardware. MemGuard uses memory bandwidth reservation to guarantee each core's minimum memory bandwidth. It then performs predictive bandwidth donation and on-demand reclaiming to redistribute excess bandwidth, improving overall utilization. Evaluation shows MemGuard isolates performance and eliminates over 50% slowdown of a foreground real-time task due to interference, while maximizing throughput via bandwidth sharing.
Operating Systems 1 (10/12) - SchedulingPeter Tröger
The document discusses processes and scheduling in operating systems. It covers key concepts like processes, threads, scheduling criteria, and different scheduling algorithms like round robin and priority-based scheduling. It also discusses scheduling in multiprocessor systems and provides examples of scheduling in Windows.
In this video from the 2015 Stanford HPC Conference, Pavel Shamis from ORNL presents: Preparing OpenSHMEM for Exascale.
"OpenSHMEM is a partitioned global address space (PGAS) one-sided communications library that enables remote memory access (RMA) across processing elements (PEs). Its API allows data to be transferred from one PE memory space to another PE’s symmetric memory space; decoupling the data transfers from synchronizations. OpenSHMEM is useful for applications that are latency driven or that have irregular communication patterns, because its one-sided API can be mapped very efficiently to hardware (e.g. RDMA interconnects, etc), and its one-sided programming model helps the overlapping of communication with computation. Summit is Oak Ridge National Laboratory’s next high performance supercomputer system that will be based on a many core/GPU hybrid architecture. In order to prepare OpenSHMEM for future systems, it is important to enhance its programming model to enable efficient utilization of the new hardware capabilities (e.g. massive multithreaded systems, accesses different type memories, next generation of interconnects, etc). This session will present recent advances in the area of OpenSHMEM extensions, implementations, and tools.”
Watch the video: http://insidehpc.com/2015/02/video-preparing-openshmem-for-exascale/
See more talks in the Stanford HPC Conference Video Gallery: http://wp.me/P3RLHQ-dOO
AN INTRODUCTION TO OPERATING SYSTEMS : CONCEPTS AND PRACTICE - PHI LearningPHI Learning Pvt. Ltd.
The book, now in its Fifth Edition, aims to provide a practical view of GNU/Linux and Windows 7, 8 and 10, covering different design considerations and patterns of use. The section on concepts covers fundamental principles, such as file systems, process management, memory management, input-output, resource sharing, interprocess communication (IPC), distributed computing, OS security, real-time and microkernel design. This thoroughly revised edition comes with a description of an instructional OS to support teaching of OS and also covers Android, currently the most popular OS for handheld systems. Basically, this text enables students to learn by practicing with the examples and doing exercises.
This document provides an overview of CPU scheduling concepts including multiprogramming, multitasking, process creation, the short-term scheduler, process control blocks, the long-term scheduler, and the medium-term scheduler. It also discusses classifications of processes as interactive, batch, or real-time processes and as I/O-bound or CPU-bound processes. Finally, it introduces common CPU scheduling algorithms like first-come first-served (FCFS), shortest job first (SJF), and round-robin (RR).
RTOS Material hfffffffffffffffffffffffffffffffffffffadugnanegero
This document provides an overview of real-time operating systems and kernel concepts across 34 slides. The key topics covered include real-time kernels, tasks and processes, scheduling algorithms like priority-based and cyclic executives, intertask communication methods like mailboxes and semaphores, and synchronization techniques.
EKernel Thesis: an object-oriented micro-kernelMurphy Chen
The document describes the design and implementation of EKernel, an object-oriented microkernel. It aims to address issues of portability, maintainability, extensibility, and efficiency in operating system design. The key aspects of EKernel's design include using processes and threads as core abstractions, implementing inter-process communication via messaging, and providing a modular architecture with well-defined interfaces. Performance tests show EKernel achieves lower overhead than other microkernels for operations like context switches and IPC. Future work plans to enhance EKernel's scheduler and implement a networking subsystem.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
The document discusses CERN's use of Oracle's In-Memory Column Store to perform real-time analysis of physics experiment data from the Large Hadron Collider. Benchmark tests showed significant performance improvements over traditional row-based storage, with analytic queries running 10-100x faster. The columnar format also improved data compression rates. Additionally, OLTP workloads saw no negative impacts. CERN plans to consider the technology for future projects given its ability to enable real-time analysis that was previously not possible.
This document discusses the slides for Unit 2 of the Operating Systems course. It includes an index of lecture topics that will be covered, such as process concepts and threads, scheduling criteria and algorithms, thread scheduling, case studies of UNIX/Linux and Windows operating systems, and revision. Key concepts that will be covered include processes and threads, process state diagrams, process control blocks, CPU scheduling queues, producer-consumer problem solutions, scheduling criteria and algorithms like FCFS, SJF, priority and round robin, and thread scheduling models.
This document discusses operating systems and their core abstractions like uninterrupted computation, infinite memory, and simple I/O. It describes how operating systems provide these abstractions using mechanisms like context switching, virtual memory, and system calls. It also covers different types of operating systems and characteristics of embedded operating systems like real-time capabilities.
Using the big guns: Advanced OS performance tools for troubleshooting databas...Nikolay Savvinov
Using OS performance tools and basic alternatives to troubleshoot production database issues
The document discusses using Linux performance tools like pidstat, ps, and tracing tools like perf, systemtap, and dtrace to troubleshoot complex database problems that may involve issues at the operating system, hardware, or network level. It provides examples of using these tools to diagnose specific issues like memory fragmentation, I/O problems, and network congestion and presents a methodology around reproducing issues, analyzing tool output, identifying root causes, and developing solutions.
Similar to It Takes Two: Instrumenting the Interaction between In-Memory Databases and Solid-State Drives CIDR 2020 presentation (20)
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
1) The document presents HINGE, a new method for embedding hyper-relational knowledge graphs that aims to better capture information from facts containing multiple relations and entities.
2) HINGE uses a CNN to learn representations from base triplets and their associated key-value pairs to characterize the plausibility of facts.
3) An evaluation on link prediction tasks shows HINGE outperforms baselines and demonstrates that the triplet structure encodes essential information, while other representations discard important information.
Representation Learning on Graphs with Complex Structures
Invited talk, Deep Learning for Graphs and Structured Data Embedding Workshop
WWW2019, San Francisco, May 13, 2019
A force directed approach for offline gps trajectory mapeXascale Infolab
SIGSPATIAL 2018 paper
A Force-Directed Approach for Offline GPS Trajectory Map Matching
Efstratios Rappos (University of Applied Sciences of Western Switzerland (HES-SO)),
Stephan Robert (University of Applied Sciences of Western Switzerland (HES-SO)),
Philippe Cudré-Mauroux (University of Fribourg)
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
This document proposes HistoSketch, a method for sketching streaming histograms that preserves similarity and adapts to concept drift. It works by:
1) Generating weighted samples from histograms such that the probability two sketches match equals histogram similarity.
2) Incrementally updating sketches using a weight decay factor to forget older data and adapt to drift over time.
3) Evaluating HistoSketch on classification tasks involving synthetic and real-world streaming data, finding it approximates histogram similarity well using small, fixed-size sketches while adapting rapidly to drift.
This document presents SwissLink, a high-precision context-free entity linking system. It extracts unambiguous surface forms (labels) from knowledge bases like DBpedia and Wikipedia to link entity mentions without context. It catalogs the surface forms, removes ambiguous ones using ratio and percentile methods, and performs fast string matching to link mentions. Evaluation on 30 Wikipedia articles shows the percentile-ratio method achieves over 95% precision and 45% recall, balancing precision and recall.
The document proposes a novel crowdsourcing system architecture and scheduling algorithm to address job starvation in multi-tenant crowd-powered systems. The architecture introduces HIT-Bundles to group heterogeneous tasks and control task serving. The Worker Conscious Fair Scheduling algorithm balances fairness and priority while minimizing worker context switching between tasks. Experiments on Amazon Mechanical Turk show the approach increases throughput over baseline schedulers and adapts to varying workforce levels and job priorities.
This document presents SANAPHOR, an ontology-based coreference resolution system that improves upon existing approaches by leveraging semantic information. It first links entities in document clusters to semantic types and ontologies. It then splits or merges clusters based on these semantic relationships. The system was evaluated on the CoNLL-2012 dataset, where it improved coreference resolution performance over the baseline Stanford system, particularly for noun clusters. By utilizing semantic knowledge, SANAPHOR demonstrates the benefits of enhancing syntactic coreference resolution with an additional semantic layer.
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
This document summarizes a presentation given at SSSW 2015 on making sense of semantic data. It discusses challenges in understanding semantic web data, including a "language gap" between semantic web languages like SPARQL and natural language. It presents an approach to bridging this gap through automatically verbalizing SPARQL queries in English. Evaluation results show this helps non-experts understand queries better and faster than the SPARQL format. It also discusses the "semantic gap" caused by mismatches between a question's semantics and a knowledge graph, and presents an approach using templates to generate SPARQL queries from natural language questions.
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
Uduvudu exploits the semantic and structured nature of Linked Data to generate the best possible representation for a human based on a catalog of available Matchers and Templates. Matchers and Templates are designed that they can be build through an intuitive editor interface.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
This document proposes three methods - LEXT, REXT, and LERIXT - for disambiguating the domain and range of properties in linked data by using context information. LEXT uses the type of subject resources, REXT uses the type of object resources, and LERIXT uses both. The methods were evaluated against expert judgments and achieved up to 96.5% precision for LEXT and 91.4% for REXT. LERIXT generated too many new sub-properties.
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
Internet Infrastructures for Big Data
Talk given at Verisign's Distinguished Speaker Series, 2014
Prof. Philippe Cudre-Mauroux
eXascale Infolab
http://exascale.info/
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
It Takes Two: Instrumenting the Interaction between In-Memory Databases and Solid-State Drives CIDR 2020 presentation
1. It Takes Two: Instrumenting the
Interaction between In-Memory
Databases and Solid-State Drives
Alberto Lerner1 Jaewook Kwak2 Sangjin Lee2 Kibin Park2
Yong Ho Song2,3 Philippe Cudré-Mauroux1
1 XI Lab – University of Fribourg, Switzerland
2 ENC Lab – Hanyang University, Korea
3 Samsung Electronics, Korea
CIDR – January 2020 - Amsterdam
2. Motivation
• Where is time going?
• CPU/cache utilization
-> HW performance counters
• Per-instruction cost
-> pprof, linux perf tool
• Operating System impact
-> systemtap, several others
• SSD performance
-> ?
2
3. Challenges in In-Memory Databases Durability
• Log needs to be written as fast as
possible
• Checkpoint competes with client
request for memory and disk
access
• Can we understand the
interference? Was the TX Log IO
pattern efficient to begin with?
¼
Users Txn’s CP workers
3
host
storage
Txn
Log
Check
point
4. Cosmos+ OpenSSD
• Idea: let’s instrument an actual
device!
• SSD rapid prototyping platform
• SoC-based
• Fully functional
• Open source firmware
• Next generation is on final stages
of development
4
9. Performance Event Records (PEV)
• Currently four types of records
IO_TIMESTAMP Regular timestamp stations
GC_TIMESTAMP FTL timestamp stations
PERFORMANCE_INDEX Aggregated counter
PERFORMANCE_INDEX_PER_CH Per channel counters
9
13. Research Agenda I - Instrumentation
• Functionality Limitations
• Currently limited at 4 channels
• Further annotations to trace back
valid copies
• Contextual triggers
• Signal Generation
• Process instrumentation records
on-the-fly
• Identify scenarios where a
scheduling policy change is
beneficial
13
14. Research Agenda II – SSD as a Platform
• Adaptive Scheduling
• Respond instantaneously to
signals generated by changing
priorities
• In-Storage Checkpoint
”Derivation”
• Move the checkpoint process
partially or entirely into the device
14
15. Conclusion
• SSDs don’t have to be black boxes
• The Instrumented Cosmos+ allows designers of both Databases and FTLs to
analyze and understand interference in workloads
• Opportunities to
• Have SSDs interact with applications in richer ways
• Exploit new possibilities of Near-Data Computing for Databases
15