The document discusses CILK and CILK++, parallel programming languages that allow spawning concurrent tasks. It covers the key language features like spawn and sync, provides examples of Fibonacci implementations, and describes the work stealing runtime system that dynamically schedules tasks across processors. The runtime uses a decentralized work stealing approach where idle processors steal tasks from other processors' task queues to balance workload.
Cilk-M is a work-stealing runtime system that solves the cactus stack problem using thread-local memory mapping (TLMM). Each worker maintains its own deque of frames and manipulates the bottom of the deque like a stack. When a worker runs out of work, it steals frames from the top of a random victim's deque. This allows Cilk-M to achieve linear speedup and bounded stack space while maintaining serial-parallel reciprocity and interoperability with legacy code.
The document summarizes the passes of the GCC compiler from parsing to code generation. It begins with parsing the source code into an AST, followed by converting the AST to GIMPLE intermediate representation and performing optimizations like constant propagation, copy propagation and dead code elimination on the GIMPLE. Control flow graphs are constructed and optimizations are performed with SSA form. RTL is generated from GIMPLE and undergoes register allocation and instruction selection before assembly code generation. Example dumps of different passes are shown.
Instrumenting Go (Gopherconindia Lightning talk by Bhasker Kode)Bhasker Kode
Lightning Talk by Bhasker Kode from Helpshift on instrumenting your golang code to a statsite compatible server. with examples, screenshots, and getting started.
This document discusses Return Oriented Programming (ROP), which is a technique for exploiting software vulnerabilities to execute malicious code without injecting new code. It can be done by manipulating return addresses on the program stack to divert execution flow to existing code snippets ("gadgets") that perform the desired task when executed in sequence. The document covers the anatomy of the x86 stack, common ROP attack approaches like stack smashing and return-to-libc, how gadgets work by chaining neutral instructions, and various defenses such as stack canaries, non-executable memory, address space layout randomization, and position-independent executables.
Taming OpenBSD Network Stack Dragons by Martin Pieuchoteurobsdcon
Abstract
After more than 30 years of evolution, the network stack used in OpenBSD still carries a lot from its original architecture.
The ongoing work to make it process network packets on multiple cores, led us to reconsider some parts if this architecture after understanding how its data structures and interfaces were really used.
This talk describes some of the non obvious internals of OpenBSD's network stack, the dragons, and the work that has been done in order to tame them.
Speaker bio
Martin Pieuchot is an OpenBSD developer and a R&D engineer working for Compumatica secure networks, a Dutch/German networking appliance manufacturer.
This document discusses using ADS-B data from a home plane spotting setup to send alerts when a specific plane's call sign is detected flying overhead. It considers various technologies like the Tick Stack and Kafka for ingesting and analyzing the streaming ADS-B data in real-time. The author tests ingesting the data into InfluxDB on AWS and viewing it in Grafana Cloud but does not get alerts working in the limited time. Lessons are learned around debugging time series data and challenges with free cloud tiers.
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
This document provides an overview of parallel computing techniques in R using various packages like snow, multicore, and parallel. It begins with motivation for parallelizing R given its limitations of being single-threaded and memory-bound. It then covers the snow package which enables explicit parallelism across computer clusters. The multicore package provides implicit parallelism using forking, but is deprecated. The parallel package acts as a wrapper for snow and multicore. It also discusses load balancing, random number generation, and provides examples of using snow and multicore for parallel k-means clustering and lapply.
This document summarizes a system called Siphon that is used at Microsoft for streaming and analyzing large volumes of data. Some key points:
- Siphon ingests an average of 3.9 million events per second (800 TB per day) from various data sources and uses over 1,700 Kafka brokers for distribution.
- It provides low latency of 10 seconds for the 99th percentile and is used for real-time analytics and insights for services like Bing, Office 365, and internal tools.
- The document then describes the Siphon architecture with data centers around the world and how it handles streaming, batch processing, and auditing of data.
Cilk-M is a work-stealing runtime system that solves the cactus stack problem using thread-local memory mapping (TLMM). Each worker maintains its own deque of frames and manipulates the bottom of the deque like a stack. When a worker runs out of work, it steals frames from the top of a random victim's deque. This allows Cilk-M to achieve linear speedup and bounded stack space while maintaining serial-parallel reciprocity and interoperability with legacy code.
The document summarizes the passes of the GCC compiler from parsing to code generation. It begins with parsing the source code into an AST, followed by converting the AST to GIMPLE intermediate representation and performing optimizations like constant propagation, copy propagation and dead code elimination on the GIMPLE. Control flow graphs are constructed and optimizations are performed with SSA form. RTL is generated from GIMPLE and undergoes register allocation and instruction selection before assembly code generation. Example dumps of different passes are shown.
Instrumenting Go (Gopherconindia Lightning talk by Bhasker Kode)Bhasker Kode
Lightning Talk by Bhasker Kode from Helpshift on instrumenting your golang code to a statsite compatible server. with examples, screenshots, and getting started.
This document discusses Return Oriented Programming (ROP), which is a technique for exploiting software vulnerabilities to execute malicious code without injecting new code. It can be done by manipulating return addresses on the program stack to divert execution flow to existing code snippets ("gadgets") that perform the desired task when executed in sequence. The document covers the anatomy of the x86 stack, common ROP attack approaches like stack smashing and return-to-libc, how gadgets work by chaining neutral instructions, and various defenses such as stack canaries, non-executable memory, address space layout randomization, and position-independent executables.
Taming OpenBSD Network Stack Dragons by Martin Pieuchoteurobsdcon
Abstract
After more than 30 years of evolution, the network stack used in OpenBSD still carries a lot from its original architecture.
The ongoing work to make it process network packets on multiple cores, led us to reconsider some parts if this architecture after understanding how its data structures and interfaces were really used.
This talk describes some of the non obvious internals of OpenBSD's network stack, the dragons, and the work that has been done in order to tame them.
Speaker bio
Martin Pieuchot is an OpenBSD developer and a R&D engineer working for Compumatica secure networks, a Dutch/German networking appliance manufacturer.
This document discusses using ADS-B data from a home plane spotting setup to send alerts when a specific plane's call sign is detected flying overhead. It considers various technologies like the Tick Stack and Kafka for ingesting and analyzing the streaming ADS-B data in real-time. The author tests ingesting the data into InfluxDB on AWS and viewing it in Grafana Cloud but does not get alerts working in the limited time. Lessons are learned around debugging time series data and challenges with free cloud tiers.
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
This document provides an overview of parallel computing techniques in R using various packages like snow, multicore, and parallel. It begins with motivation for parallelizing R given its limitations of being single-threaded and memory-bound. It then covers the snow package which enables explicit parallelism across computer clusters. The multicore package provides implicit parallelism using forking, but is deprecated. The parallel package acts as a wrapper for snow and multicore. It also discusses load balancing, random number generation, and provides examples of using snow and multicore for parallel k-means clustering and lapply.
This document summarizes a system called Siphon that is used at Microsoft for streaming and analyzing large volumes of data. Some key points:
- Siphon ingests an average of 3.9 million events per second (800 TB per day) from various data sources and uses over 1,700 Kafka brokers for distribution.
- It provides low latency of 10 seconds for the 99th percentile and is used for real-time analytics and insights for services like Bing, Office 365, and internal tools.
- The document then describes the Siphon architecture with data centers around the world and how it handles streaming, batch processing, and auditing of data.
**Return-oriented programming** bezeichnet eine gewiefte IT-Angriffstechnik, die im Prinzip eine Verallgemeinerung von *return-to-libc*-Attacken ist, welche wiederum zu den *stack buffer overflow exploits* gehören.
Wem das alles nichts sagt - keine Angst: Im Vortrag werden zunächst die Grundlagen von Puffer-Überläufen und deren Angriffspotential erläutert und einige historische Beispiele aufgezeigt, bevor schrittweise die Brücke zu **ROP** geschlagen wird. Zum Abschluss werden kurz einige Abwehrmaßnahmen vorgestellt und im Hinblick auf Umsetzbarkeit und Wirkungsgrad bewertet.
So die Demo-Götter es wollen, wird live u.A. ein Beispiel-Programm mithilfe von **ROP**-Tools gecrackt.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Inside LoLA - Experiences from building a state space tool for place transiti...Universität Rostock
LoLA is a state space tool for analyzing place/transition nets that was developed starting in 1998. It uses various reduction techniques like stubborn sets, symmetries, and linear algebra to combat state space explosion. LoLA has been applied to problems in areas like model checking, business process verification, and distributed systems. Its core data structures and algorithms keep processing costs low during operations like firing transitions and state space traversal.
A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas
This document compares the numerical programming capabilities and performance of Perl and Python with and without numerical libraries like NumPy and PDL. It implements a trapezoidal quadrature rule to integrate three different functions in standard C, optimized C, Python, Python with NumPy, Python with numarray, Perl, and Perl with PDL. The results show that plain Python and Perl are much slower than C, but with numerical libraries their performance is comparable to optimized C for problems that can be formulated as element-by-element array operations. NumPy performs worse for simple functions but the gap decreases for more complex functions that use trigonometric operations. So for numerical problems, Python and Perl with add-on libraries can be viable alternatives to C/C
Computational Techniques for the Statistical Analysis of Big Data in Rherbps10
The document describes techniques for improving the computational performance of statistical analysis of big data in R. It uses as a case study the rlme package for rank-based regression of nested effects models. The workflow involves identifying bottlenecks, rewriting algorithms, benchmarking versions, and testing. Examples include replacing sorting with a faster C++ selection algorithm for the Wilcoxon Tau estimator, vectorizing a pairwise function, and preallocating memory for a covariance matrix calculation. The document suggests future directions like parallelization using MPI and GPUs to further optimize R for big data applications.
This document summarizes a MatLab workshop covering various topics:
- MatLab Central, an official user community for asking and answering questions
- Cipher systems for encrypting text by shifting letters
- Least squares methods for fitting linear models to noisy data
- Dynamic code generation to display customized text
- Techniques for accelerating MatLab code such as vectorization and memory preallocation
The workshop emphasized best practices like utilizing help functions, saving intermediate work, and having fun with programming.
Abstract: This research introduces a novel dataflow hardware description abstraction layer which covers the numerous synthesizable uses of RTL constructs and replaces them with higher-level abstractions. We also present DFiant, a Scala-embedded HDL that applies the dataflow semantics to decouple design functionality from its constraints. DFiant provides a strong bit-accurate type safe foundations to describe hardware in a very concise and portable fashion. The DFiant compiler can automatically pipeline designs to meet performance requirements as synthesizable RTL code.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaJuan Fumero
This document discusses runtime and data management techniques for heterogeneous computing in Java. It presents an approach that uses three levels of abstraction: parallel skeletons API based on functional programming, a high-level optimizing library that rewrites operations to target specific hardware, and OpenCL code generation and runtime with data management for heterogeneous architectures. It describes how the runtime performs type inference, IR generation, optimizations, and kernel generation to compile Java code into OpenCL kernels. It also discusses how custom array types are used to reduce data marshaling overhead between the Java and OpenCL runtimes.
Recursion & Erlang, FunctionalConf 14, BangaloreBhasker Kode
The document discusses the history and design of the Erlang programming language. Some key points:
1) Erlang was designed in 1986 at Ericsson for writing concurrent programs that "run forever." It was created by Joe Armstrong to address the needs of building telephony systems.
2) Concurrency was the primary goal in designing Erlang. This influenced decisions like message passing between processes instead of shared memory, and copying data between processes for isolation.
3) Tail recursion and the actor model were incorporated due to their suitability for implementing concurrent processes and distributed systems. Tail recursion allows processes to be spawned efficiently while preserving state.
In the slide, i describe the basis of python programming and their function. If any doubt in the slide, contact me through mail or linked in. My mail id is mdsathees@gmail.com
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
This document discusses compiler optimizations. It begins with an outline of topics including compilation trajectory, intermediate languages, optimization levels, and optimization techniques. It then provides more details on each phase of compilation, how compilers use intermediate representations to perform optimizations, and specific optimizations like common subexpression elimination, constant propagation, and instruction scheduling.
The document summarizes key points from a presentation titled "Effective Python Programming - OSCON 2005". It discusses Python fundamentals like namespaces, duck typing and exceptions. It also covers structured programming techniques in Python like iterators, generators, for/else loops and try/finally blocks. The presentation emphasizes writing effective code that makes use of Python features and follows best practices.
GCC is a widely used open source compiler. It consists of frontends for languages like C and C++ and backends that generate code for different CPU architectures. The GCC Extensibility Made Easy (GEM) framework allows dynamically loading modules to extend GCC functionality. Examples include adding new language features, improving security, and facilitating operating system development.
GCC compilers use several stages to compile C/C++ code into executable programs:
1. The preprocessor handles #include, #define, and other preprocessor directives.
2. The front-end parses the code into an abstract syntax tree (AST) and performs type checking and semantic analysis.
3. The middle-end converts the AST into the GIMPLE intermediate representation and performs optimizations like dead code elimination and constant propagation before generating register transfer language (RTL).
4. The back-end selects target-specific instructions, allocates registers, schedules instructions, and outputs assembly code, which is then linked together with other object files by the linker into a final executable.
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
This document discusses various compiler optimizations including constant folding, hoisting loop invariant code, scalarization, loop unswitching, peeling and sentinels, strength reduction, loop induction variable elimination, and auto-vectorization. It provides code examples and the generated assembly for each optimization. It explains that many optimizations are performed by compilers automatically at high optimization levels, while some more advanced optimizations like loop peeling and sentinels require manual intervention.
NVIDIA joined OpenMP in 2011 to contribute to discussions around parallel accelerators. In 2012, NVIDIA proposed the TEAMS construct for accelerators, which was included in OpenMP 4.0 released in 2013 with support for accelerator directives. NVIDIA supports OpenMP because it is the dominant standard for directive-based parallel programming and allowing applications to easily accelerate using OpenMP provides a way for NVIDIA to reach more developers across multiple domains.
Object Detection Methods using Deep LearningSungjoon Choi
The document discusses object detection techniques including R-CNN, SPPnet, Fast R-CNN, and Faster R-CNN. R-CNN uses region proposals and CNN features to classify each region. SPPnet improves efficiency by computing CNN features once for the whole image. Fast R-CNN further improves efficiency by sharing computation and using a RoI pooling layer. Faster R-CNN introduces a region proposal network to generate proposals, achieving end-to-end training. The techniques showed improved accuracy and processing speed over prior methods.
This ppt show concept of Data Link Access, BSD Packet Filter, DLPI, Linux SOCK_PACKET, libpcap–Packet capture Library, libnet: Packet Creation and Injection Library
Detecting Deadlock, Double-Free and Other Abuses in a Million Lines of Linux ...Peter Breuer
Presentation at 30th Annual IEEE/NASA Software Engineering Workshop (SEW-30), Loyola College Graduate Center, Columbia, MD, USA, April 25, 2006. The preprint of the paper is at http://www.academia.edu/1413564/Detecting_deadlock_double-free_and_other_abuses_in_a_million_lines_of_linux_kernel_source. DOI 10.1109/SEW.2006.1 .
Make static instrumentation great again, High performance fuzzing for Windows...Lucas Leong
This document discusses making static binary instrumentation great again for high performance fuzzing on Windows systems. It motivates static instrumentation as an alternative to dynamic approaches, describes the implementation of static instrumentation using IDA Pro and modifying PE files, benchmarks showing comparable performance to WinAFL, and case studies finding vulnerabilities through fuzzing kernel drivers and libraries.
Power Up Your Build - Omer van Kloeten @ Wix 2018-04Omer van Kloeten
I was invited to give this talk at the Wix Backend Guild Day, an internal event which was broadcast live internationally, on 2018-04-12
Video: https://youtu.be/cQ7UvUybceA
These days sbt is the de-facto build tool for Scala, but most of us just write the minimum viable build.sbt file, import the libraries we need (and maybe throw in some sbt-assembly) and forget about it.
In this Good Practices session, you will learn about making your build safer and more robust by making the Scala compiler work for you and through using some sbt plugins.
This talk will be quite high-level. There will be no need for prior knowledge of sbt and it should be beneficial for you even if you don’t use sbt.
**Return-oriented programming** bezeichnet eine gewiefte IT-Angriffstechnik, die im Prinzip eine Verallgemeinerung von *return-to-libc*-Attacken ist, welche wiederum zu den *stack buffer overflow exploits* gehören.
Wem das alles nichts sagt - keine Angst: Im Vortrag werden zunächst die Grundlagen von Puffer-Überläufen und deren Angriffspotential erläutert und einige historische Beispiele aufgezeigt, bevor schrittweise die Brücke zu **ROP** geschlagen wird. Zum Abschluss werden kurz einige Abwehrmaßnahmen vorgestellt und im Hinblick auf Umsetzbarkeit und Wirkungsgrad bewertet.
So die Demo-Götter es wollen, wird live u.A. ein Beispiel-Programm mithilfe von **ROP**-Tools gecrackt.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Inside LoLA - Experiences from building a state space tool for place transiti...Universität Rostock
LoLA is a state space tool for analyzing place/transition nets that was developed starting in 1998. It uses various reduction techniques like stubborn sets, symmetries, and linear algebra to combat state space explosion. LoLA has been applied to problems in areas like model checking, business process verification, and distributed systems. Its core data structures and algorithms keep processing costs low during operations like firing transitions and state space traversal.
A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas
This document compares the numerical programming capabilities and performance of Perl and Python with and without numerical libraries like NumPy and PDL. It implements a trapezoidal quadrature rule to integrate three different functions in standard C, optimized C, Python, Python with NumPy, Python with numarray, Perl, and Perl with PDL. The results show that plain Python and Perl are much slower than C, but with numerical libraries their performance is comparable to optimized C for problems that can be formulated as element-by-element array operations. NumPy performs worse for simple functions but the gap decreases for more complex functions that use trigonometric operations. So for numerical problems, Python and Perl with add-on libraries can be viable alternatives to C/C
Computational Techniques for the Statistical Analysis of Big Data in Rherbps10
The document describes techniques for improving the computational performance of statistical analysis of big data in R. It uses as a case study the rlme package for rank-based regression of nested effects models. The workflow involves identifying bottlenecks, rewriting algorithms, benchmarking versions, and testing. Examples include replacing sorting with a faster C++ selection algorithm for the Wilcoxon Tau estimator, vectorizing a pairwise function, and preallocating memory for a covariance matrix calculation. The document suggests future directions like parallelization using MPI and GPUs to further optimize R for big data applications.
This document summarizes a MatLab workshop covering various topics:
- MatLab Central, an official user community for asking and answering questions
- Cipher systems for encrypting text by shifting letters
- Least squares methods for fitting linear models to noisy data
- Dynamic code generation to display customized text
- Techniques for accelerating MatLab code such as vectorization and memory preallocation
The workshop emphasized best practices like utilizing help functions, saving intermediate work, and having fun with programming.
Abstract: This research introduces a novel dataflow hardware description abstraction layer which covers the numerous synthesizable uses of RTL constructs and replaces them with higher-level abstractions. We also present DFiant, a Scala-embedded HDL that applies the dataflow semantics to decouple design functionality from its constraints. DFiant provides a strong bit-accurate type safe foundations to describe hardware in a very concise and portable fashion. The DFiant compiler can automatically pipeline designs to meet performance requirements as synthesizable RTL code.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaJuan Fumero
This document discusses runtime and data management techniques for heterogeneous computing in Java. It presents an approach that uses three levels of abstraction: parallel skeletons API based on functional programming, a high-level optimizing library that rewrites operations to target specific hardware, and OpenCL code generation and runtime with data management for heterogeneous architectures. It describes how the runtime performs type inference, IR generation, optimizations, and kernel generation to compile Java code into OpenCL kernels. It also discusses how custom array types are used to reduce data marshaling overhead between the Java and OpenCL runtimes.
Recursion & Erlang, FunctionalConf 14, BangaloreBhasker Kode
The document discusses the history and design of the Erlang programming language. Some key points:
1) Erlang was designed in 1986 at Ericsson for writing concurrent programs that "run forever." It was created by Joe Armstrong to address the needs of building telephony systems.
2) Concurrency was the primary goal in designing Erlang. This influenced decisions like message passing between processes instead of shared memory, and copying data between processes for isolation.
3) Tail recursion and the actor model were incorporated due to their suitability for implementing concurrent processes and distributed systems. Tail recursion allows processes to be spawned efficiently while preserving state.
In the slide, i describe the basis of python programming and their function. If any doubt in the slide, contact me through mail or linked in. My mail id is mdsathees@gmail.com
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
This document discusses compiler optimizations. It begins with an outline of topics including compilation trajectory, intermediate languages, optimization levels, and optimization techniques. It then provides more details on each phase of compilation, how compilers use intermediate representations to perform optimizations, and specific optimizations like common subexpression elimination, constant propagation, and instruction scheduling.
The document summarizes key points from a presentation titled "Effective Python Programming - OSCON 2005". It discusses Python fundamentals like namespaces, duck typing and exceptions. It also covers structured programming techniques in Python like iterators, generators, for/else loops and try/finally blocks. The presentation emphasizes writing effective code that makes use of Python features and follows best practices.
GCC is a widely used open source compiler. It consists of frontends for languages like C and C++ and backends that generate code for different CPU architectures. The GCC Extensibility Made Easy (GEM) framework allows dynamically loading modules to extend GCC functionality. Examples include adding new language features, improving security, and facilitating operating system development.
GCC compilers use several stages to compile C/C++ code into executable programs:
1. The preprocessor handles #include, #define, and other preprocessor directives.
2. The front-end parses the code into an abstract syntax tree (AST) and performs type checking and semantic analysis.
3. The middle-end converts the AST into the GIMPLE intermediate representation and performs optimizations like dead code elimination and constant propagation before generating register transfer language (RTL).
4. The back-end selects target-specific instructions, allocates registers, schedules instructions, and outputs assembly code, which is then linked together with other object files by the linker into a final executable.
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
This document discusses various compiler optimizations including constant folding, hoisting loop invariant code, scalarization, loop unswitching, peeling and sentinels, strength reduction, loop induction variable elimination, and auto-vectorization. It provides code examples and the generated assembly for each optimization. It explains that many optimizations are performed by compilers automatically at high optimization levels, while some more advanced optimizations like loop peeling and sentinels require manual intervention.
NVIDIA joined OpenMP in 2011 to contribute to discussions around parallel accelerators. In 2012, NVIDIA proposed the TEAMS construct for accelerators, which was included in OpenMP 4.0 released in 2013 with support for accelerator directives. NVIDIA supports OpenMP because it is the dominant standard for directive-based parallel programming and allowing applications to easily accelerate using OpenMP provides a way for NVIDIA to reach more developers across multiple domains.
Object Detection Methods using Deep LearningSungjoon Choi
The document discusses object detection techniques including R-CNN, SPPnet, Fast R-CNN, and Faster R-CNN. R-CNN uses region proposals and CNN features to classify each region. SPPnet improves efficiency by computing CNN features once for the whole image. Fast R-CNN further improves efficiency by sharing computation and using a RoI pooling layer. Faster R-CNN introduces a region proposal network to generate proposals, achieving end-to-end training. The techniques showed improved accuracy and processing speed over prior methods.
This ppt show concept of Data Link Access, BSD Packet Filter, DLPI, Linux SOCK_PACKET, libpcap–Packet capture Library, libnet: Packet Creation and Injection Library
Detecting Deadlock, Double-Free and Other Abuses in a Million Lines of Linux ...Peter Breuer
Presentation at 30th Annual IEEE/NASA Software Engineering Workshop (SEW-30), Loyola College Graduate Center, Columbia, MD, USA, April 25, 2006. The preprint of the paper is at http://www.academia.edu/1413564/Detecting_deadlock_double-free_and_other_abuses_in_a_million_lines_of_linux_kernel_source. DOI 10.1109/SEW.2006.1 .
Make static instrumentation great again, High performance fuzzing for Windows...Lucas Leong
This document discusses making static binary instrumentation great again for high performance fuzzing on Windows systems. It motivates static instrumentation as an alternative to dynamic approaches, describes the implementation of static instrumentation using IDA Pro and modifying PE files, benchmarks showing comparable performance to WinAFL, and case studies finding vulnerabilities through fuzzing kernel drivers and libraries.
Power Up Your Build - Omer van Kloeten @ Wix 2018-04Omer van Kloeten
I was invited to give this talk at the Wix Backend Guild Day, an internal event which was broadcast live internationally, on 2018-04-12
Video: https://youtu.be/cQ7UvUybceA
These days sbt is the de-facto build tool for Scala, but most of us just write the minimum viable build.sbt file, import the libraries we need (and maybe throw in some sbt-assembly) and forget about it.
In this Good Practices session, you will learn about making your build safer and more robust by making the Scala compiler work for you and through using some sbt plugins.
This talk will be quite high-level. There will be no need for prior knowledge of sbt and it should be beneficial for you even if you don’t use sbt.
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
The document discusses linkers and loaders, describing their functions in combining object files into executable files. It covers the ELF format, static vs dynamic linking, and how executable files are run using static or dynamic linkers. Key points include how static linkers resolve symbols and perform relocation, while dynamic linkers use shared libraries and handle relocation at runtime via the dynamic linker.
The document summarizes a presentation given by Theo Jungeblut on the topic of clean code. It discusses why clean code is important for maintainability. It also provides an overview of tools like Resharper, FxCop, StyleCop, GhostDoc and Code Contracts that can help write clean code. Principles of clean code like KISS, DRY, SoC and patterns like dependency injection are explained. The presentation emphasizes that maintainability is key to preventing code from bringing a development organization to its knees.
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types.
Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that:
- abstraction and expressiveness are maximized - cost over efficiency is minimized
We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.
K-CAI NEURAL API is a Keras based neural network API for machine learning that will allow you to prototype with a lots of possibilities of Tensorflow! Python, Free Pascal and Delphi together in Google Colab, Git or the Community Edition.
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)Theo Jungeblut
This document provides an overview of clean code principles and practices. It discusses topics like why clean code matters, definitions of clean code, tools that help enable clean code like Resharper and FxCop, principles of clean code development such as KISS, DRY, SoC and SRP, and coding conventions. The presentation aims to demonstrate how writing clean code can improve code maintainability and efficiency. It also provides references to influential books on clean code by authors like Robert Martin.
This document discusses functional programming and its benefits. It begins with an overview of functional programming concepts like pure functions, referential transparency, and immutability. It then covers functional programming techniques like higher order functions, recursion, composition, and pattern matching. Examples are given comparing imperative and functional implementations for quicksort and optional types. The document argues that functional programming leads to cleaner code by improving modularity, testability and adherence to SOLID principles. It recommends starting with functional features in existing languages and learning Haskell to fully embrace the functional paradigm.
The document provides an overview of how Ruby programs are compiled and executed. It discusses how Ruby source code is tokenized and turned into an abstract syntax tree (AST) before being compiled into bytecode. It then describes how the Ruby interpreter implements a virtual machine that maps bytecode instructions to native operations. Key aspects covered include Ruby using a stack-based execution model, the interaction between the C stack, virtual machine stack, and Ruby call stack, and how garbage collection works through mark and sweep to reclaim unused memory.
Jordan Wiens & Peter LaFosse
Modern binary analysis, whether for discovering vulnerabilities or analyzing malware needs automation to deal with the volume of code under inspection. And yet, while Intermediate Languages (ILs) have been used for decades in compiler design and implementation, too few reverse engineers have any experience with them even though many reverse engineering tools (Binary Ninja, Ghidra, IDA) are built on top of ILs. Given that, it's time to demystify this space and make it accessible beyond just computer scientists and researchers. There's many potentially unfamiliar concepts related to ILs: single-static assignment, value-set analysis, three argument form versus tree-based designs, and others. But what matters is how these ILs can help you build better binary analysis tools. This talk not only gives you an overview of existing ILs used in reverse engineering, but more importantly, shows you how your tooling can benefit from them. From cross-platform analysis (follow a botnet from an x86-64 desktop to a mobile arm, to an embedded MIPS), to leveraging existing data-flow capabilities that brings some of the benefits both dynamic and static analysis together, this talk will demonstrate several examples of plugins that leverage ILs to improve your ability to automatically reason over compiled code.
Solving Cross-Cutting Concerns in PHP - DutchPHP Conference 2016 Alexander Lisachenko
Talk about solving cross-cutting concerns in PHP at DutchPHP Conference.
Discussed questions:
1) OOP features and limitations
2) OOP patterns for solving cross-cutting concerns
3) Aspect-Oriented approach for solving cross-cutting concerns
4) Examples of using AOP for real life application
Implementation and Comparison of Softcore Multiplier Architectures for FPGAsShahid Abbas
The document discusses the implementation and comparison of softcore multiplier architectures for FPGAs. It introduces multiplier architectures like LUT-based multipliers using 3x3, 3x2 and 1x4 LUT structures. It also discusses the FloPoCo library and bit heaps for performing additions. Target specific implementations are explored along with automated and manual methods. Simulation and synthesis results are presented to evaluate the architectures.
The document discusses the STL algorithms in C++. It begins by defining what algorithms and STL algorithms are. It then covers the different classes of STL algorithms including non-modifying sequence operations, mutating sequence operations, sorting operations, general C algorithms, and general numeric operations. Specific algorithms like for_each, transform, all_of, any_of and none_of are discussed in more detail through examples. The document aims to explain what STL algorithms are and how they can be used to operate on sequences and containers in C++.
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...HostedbyConfluent
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson | Current 2022
Imagine having access to metrics, events, and insights without code modification or application redeployment. Imagine visualizing delays and tracking down performance bottlenecks in your Kafka pipeline instantly with minimal performance overhead. In this session, we show all of this is possible with eBPF.
In a live demo, we will introduce an eBPF-based, always-on, CPU profiler to visualize what your Kafka applications are spending time on. We will analyze how much time the Kafka broker spends on handling different requests and responding to polling and how much time a Kafka consumer spends on polling the broker and processing the messages. Furthermore, we will see how to detect issues by measuring consumer lags in both offsets and seconds, and how to correlate the increasing consumer lag with the CPU flame graphs. We demonstrate how not only to detect issues quickly but also to pinpoint performance bottlenecks instantly in the Kafka pipeline: e.g. garbage collection and disk/network IO.
In addition, we will provide some unique insights with eBPF: e.g. topic-centric flow graphs, consumer rebalancing lags, and under-replicated partitions.
Collecting all the data with no instrumentation and low overhead is no easy task. we will conclude by revealing the magic of eBPF and discussing the design choices and technical challenges of our network traffic tracer and Java CPU profiler that empowered deep visibility into Kafka.
This document provides an introduction to static analysis techniques for malware analysis. It begins with an overview of static analysis and the information that can be gleaned without executing code, such as file structure, binary code, related modules, and suspicious strings. Common Linux tools for static analysis like strings, file, hexdump, and objdump are introduced. Disassembly, the process of converting binary machine code to assembly code, is explained. Reverse engineering disassembled code back into C code involves understanding variables, data movement, arithmetic, control flow, functions, and calling conventions. The document concludes by introducing IDA Pro as a popular disassembler and decompiler tool for static analysis.
JIT vs. AOT: Unity And Conflict of Dynamic and Static Compilers Nikita Lipsky
Java had been constantly criticized for poor performance ever since its inception, but not so much in recent years. Thanks to optimizing dynamic native code compilers, Java performance today is very close to the performance of low level languages such as C/C++, and is even better on some classes of applications. Along with dynamic compilers, static compilers for Java have been evolving as well, so there is still no clear winner among these two approaches. It should then come as no surprise that an AOT compiler is finally going to appear even in the HotSpot JVM and OpenJDK via JEP-295, which is officially included in Java 9.
Her, I would like to dispel common myths around the old dispute on whether dynamic or static compilation is better, show that both approaches have their strengths and weaknesses, and explain why the future is the hybrid approach.
Concepts of Functional Programming for Java Brains (2010)Peter Kofler
This document provides a summary of concepts in functional programming. It discusses topics like lambda calculus, pure functions, immutable data, recursion, higher order functions, lists, folding, mapping, filtering. It provides examples in languages like Ruby, Scala, JavaScript. It also mentions ideas like laziness, currying, monads but says they were skipped. The presentation aims to introduce functional programming concepts.
Seattle Cassandra Users: An OSS Java Abstraction Layer for CassandraJosh Turner
Project Casquatch is a database abstraction layer with code generation designed to streamline Cassandra development. Out of the box it comes pre-tuned with high available policies including load balancing, geo-redundancy, connection pooling, etc., sitting on top of the DataStax driver using native APIs. All of this is abstracted behind the ever prevalent POJO. Instead of writing CQL, we utilize generic programming that allows you to simply pass a generated POJO to a save() method or populate with a getById(). This is the same code reportedly used by T-Mobile for multiple national platforms including the activation of the Apple Watch and Galaxy Watch, T-Mobile Payments, Digits, and many others.
As presented at Seattle Cassandra Users Group on June . 26th, 2019.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
2. OUTLINE
• CILK and CILK++ Language Features and
Usages
• Work stealing runtime
• CILK++ Reducers
• Conclusions
2
3. IDEALIZED SHARED
MEMORY ARCHITECTURE
3
• Hardware model
• Processors
• Shared global
memory
• Software model
• Threads
• Shared variables
• Communication
• Synchronization
Slide from Comp 422 Rice University Lecture 4
4. CILK AND CILK++
DESIGN GOALS
• Programmer friendly
• Dynamic tasking
• Parallel extension to C
• Scalable performance
• Efficient runtime system
• Minimum program overhead
4
5. CILK KEYWORDS
• Cilk: a Cilk function
• Spawn: call can execute asynchronously
in a concurrent thread
• Sync: current thread waits for all locally-
spawned functions
5
6. CILK EXAMPLE
cilk int fib(n) {
if (n < 2)
return n;
else {
int n1, n2;
n1 = spawn fib(n-1);
n2 = spawn fib(n-2);
sync;
return (n1 + n2);
}
}
6
Borrowed from Comp 422 Rice University Lecture 4
7. CILK++ EXAMPLE
int fib(n) {
if (n < 2)
return n;
else {
int n1, n2;
n1 = cilk_spawn fib(n-1);
n2 = fib(n-2);
cilk_sync;
return (n1 + n2);
}
}
7
Borrowed from Comp 422 Rice University Lecture 4
8. CILK++ EXAMPLE
WITH DAG
8
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
9. OUTLINE
• CILK and CILK++ Language Features and
Usages
• Work stealing runtime
• CILK++ Reducers
• Conclusions
9
10. WORK FIRST
PRINCIPLE
• Work: T1
• Critical path length: T∞
• Number of processor: P
• Expected time
• Tp = T1/P + O(T∞)
• Parallel slackness assumption
• T1/P >> C∞T∞
10
11. WORK FIRST
PRINCIPLE
• Minimize scheduling overhead borne by
work at the expense of increasing critical
path
• Tp ≤ C1Ts/P + C∞T∞
≈ C1Ts/P
Minimize C1 even at the expense of a larger
C∞
11
12. WORK STEALING
DESIGN GOALS
• Minimizing contentions
• Decentralized task deque
• Doubly linked deque
• Minimizing communication
• Steal work rather than push work
• Load balance across cores
• Lazy task creation
• Steal from the top of the deque
12
13. CILK WORK STEALING
SCHEDULER
13
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
14. CILK WORK STEALING
SCHEDULER
14
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
15. CILK WORK STEALING
SCHEDULER
15
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
16. CILK WORK STEALING
SCHEDULER
16
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
17. CILK WORK STEALING
SCHEDULER
17
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
18. CILK WORK STEALING
SCHEDULER
18
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
19. CILK WORK STEALING
SCHEDULER
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
20. CILK WORK STEALING
SCHEDULER
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
21. CILK WORK STEALING
SCHEDULER
21
Pictures from “Reducers and Other CILK+ HyperObjects”
Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel).
Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
22. CILK WORK STEALING
SCHEDULER
22
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
23. CILK WORK STEALING
SCHEDULER
23
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
24. CILK WORK STEALING
SCHEDULER
24
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
25. CILK WORK STEALING
SCHEDULER
25
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
26. CILK WORK STEALING
SCHEDULER
26
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
27. TWO CLONE
STRATEGY
• Fast clone
• Identical in most respects to the C elision of the Cilk
program
• Very little execution overhead
• Sync statements compile to no op
• Allocates an continuation
• Program variables and instruction pointer
• Slow clone
• Convert a spawn schedule to slow clone only when it
is stolen
• Restores program state from activation frame that
contains local variables, program counter and other
parts of the procedure instance
27
29. SLOW CLONE
Slow_fib(frame * _cilk_frame){
restore states of the program
switch (_cilk_frame->header.entry)
{
fast_fib(_cilk_frame->n - 1 );
case 1: goto _cilk_sync1;
fast_fib(_cilk_frame->n - 2 );
case 2: goto _cilk_sync2;
sync (not a no op)
case 3: goto _cilk_sync3;
}
}
29
31. FRAMES
• C++ Main Frame
• Local variables of the procedure instance
• Temporary variables
• Linkage information for return values
31
32. FRAMES
• CILK++ Stack Frame
• Everything in C++ Main Frame
• Continuation
• Parent pointer
• Have exactly one child
• Used by Fast Clone
• A worker can have multiple Stack Frames
32
33. FRAMES
• CILK++ Full Frame (used by slow clone)
• Everything in CILK++ Stack Frame
• Lock
• Join counter
• List of children (has more than one
children)
• A worker has at most one Full Frame
33
34. FUNCTION CALL
34
Stack frame
Full frame
Extended Deque (Before Function Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
35. FUNCTION CALL
35
Stack frame
Full frame
Extended Deque (After Function Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
New stack
frame
36. SPAWN
36
Stack frame
Full frame
Extended Deque (Before Spawn Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
37. SPAWN
37
Stack frame
Full frame
Extended Deque (After Spawn Call)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Set
continuation
in last stack
frame
38. RESUME FULL FRAME
38
Stack frame
Full frame
Extended DequeFunction call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Set the full frame to be the only frame in the
call stack, resume execution on the
continuation
39. RANDOMLY STEAL
39
Stack frame
Full frame
Extended DequeFunction call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Steal this call stack
40. RANDOMLY STEAL
40
Stack frame
Full frame
Extended DequeFunction call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Steal this call stack
1 1 1
41. RANDOMLY STEAL
41
Stack frame
Full frame
Extended Deque
Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
1
1 1
42. PROVABLY GOOD
STEAL
42
Stack frame
Full frame
Extended DequeFunction call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
0
44. FUNCTION CALL
RETURN
44
Stack frame
Full frame
Extended Deque (Before Return from a Call Case1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
45. FUNCTION CALL
RETURN
45
Stack frame
Full frame
Extended Deque (Return from a Call Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
46. FUNCTION CALL
RETURN
46
Stack frame
Full frame
Extended Deque (Return from a Call Case2)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Worker executes an
unconditional steal
47. SPAWN RETURN
47
Stack frame
Full frame
Extended Deque (Before Spawn return Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
48. SPAWN RETURN
48
Stack frame
Full frame
Extended Deque (After Spawn return Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
49. SPAWN RETURN
49
Stack frame
Full frame
Extended Deque (Return from a SpawnCase2)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Worker executes an
provably good steal
50. SYNC
50
Stack frame
Full frame
Extended Deque (Sync Case 1)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Do nothing if it
is a stack
frame (No Op)
51. SYNC
51
Stack frame
Full frame
Extended Deque (Sync Case 2)Function call
Spawn
Call return
Spawn return
Sync
Randomly steal
Provably good
steal
Unconditionally
steal
Resume full
frame
Pop the frame,
provably good steal
52. OUTLINE
• CILK and CILK++ Language Features and
Usages
• Work stealing runtime
• CILK++ Reducers
• Conclusions
52
53. PROBLEMS WITH
NON-LOCAL VARIABLES
bool has_property(Node *)
List<Node *> output_list;
void walk(Node *x)
{
if (x) {
if (has_property(x))
output_list.push_back(x);
cilk_spawn walk(x->left);
walk(x->right);
cilk_sync;
}
}
53
54. REDUCER
DESIGN GOALS
• Support parallelization of programs
containing global variables
• Enable efficient parallel scaling by
avoiding a single point of contention
• Provide deterministic result for
associative reduce operations
• Operate independently of any control
constructs
54
55. REDUCER EXAMPLE
bool has_property(Node *)
List_append_reducer<Node *> output_list;
void walk(Node *x)
{
if (x) {
if (has_property(x))
output_list.push_back(x);
cilk_spawn walk(x->left);
walk(x->right);
cilk_sync;
}
}
55
56. HYPER OBJECTS
56
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
57. REDUCER
57
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
58. SEMANTICS OF
REDUCERS
• The child strand owns the view owned by
parent function before cilk_spawn
• The parent strand owns a new view,
initialized to identity view e,
• A special optimization ensures that if a
view is unchanged when combined with
the identity view
• Parent strand P own the view from
completed child strands
58
59. REDUCING OVER LIST
CONCATENATION
59
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
60. REDUCING OVER LIST
CONCATENATION
60
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
61. IMPLEMENTATION OF
REDUCER
• Each worker maintains a hypermap
• Hypermap
• Maps reducers to the views
• User
• The view of the current procedure
• Children
• The view of the children procedures
• Right
• The view of right sibling
• Identity
• The default value of a view
61
63. HYPERMAP CREATION
64
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
64. HYPERMAP CREATION
65
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
65. HYPERMAP CREATION
66
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
66. HYPERMAP CREATION
67
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
67. HYPERMAP CREATION
68
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
68. LOOK UP FAILURE
• Inserts a view containing an identity
element for the reducer into the
hypermap.
• Following the lazy principle
• Look up returns the newly inserted
identity view
69
69. RANDOM WORK
STEALING
A random steal operation steals a full frame
P and replaces it with a new full frame C in
the victim.
USERC ← USERP;
U S E R P ← 0/ ;
CHILDRENP←0/;
RIGHTP←0/.
70
70. RANDOM WORK
STEALING
71
Pictures from “Reducers and Other CILK+ HyperObjects” Talk by Matteo Frigo (Intel). Pablo
Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
71. RETURN FROM A CALL
Let C be a child frame of the parent frame P
that originally called C, and suppose that C
returns.
• If C is a stack frame, do nothing,
• If C is a full frame.
• Transfer ownership of view
• Children and Right are empty
• USERP ← USERC
77
72. RETURN FROM A
SPAWN
Let C be a child frame of the parent frame P that
originally spawned C, and suppose that C returns.
• Always do USERC ← REDUCE(USERC,RIGHTC)
• If C is a stack frame, do nothing
• If C is a full frame
• If C has siblings,
• RIGHTL ← REDUCE(RIGHTL,USERC)
• C is the leftmost child
• CHILDRENP ←
REDUCE(CHILDRENP,USERC)
78
73. SYNC
A cilk_sync statement waits until all children have com-
pleted. When frame P executes a cilk_sync, one of following
two cases applies:
• If P is a stack frame, do nothing.
• If P is a full frame,
• USERP ← REDUCE(CHILDRENP,USERP).
82
75. OUTLINE
• CILK and CILK++ Language Features and
Usages
• Work stealing runtime
• CILK++ Reducers
• Conclusions
84
76. CONCLUSIONS
• CILK and CILK++ provide a programmer
friendly programming model
• Extension to C
• Incremental parallelism
• Scaling on future machines
• Non-compromising performance
• Work stealing runtime
• Minimizing overheads
• Reducers
85
77. FINAL NOTES
• Designed for an idealized shared memory
model
• Today’s architectures are typically NUMA
• Task creation can be lazier
• http://ieeexplore.ieee.org/xpls/abs_all.jsp?
arnumber=6012915&tag=1
• Cilk_for
• Divide and conquer parallelization
86
Editor's Notes
CILK and CILK++ adopt the shared memory model, No uniform address, sockets, abstraction
If you have taken Comp 322, Spawn is very similar to the “async” keyword in Habanero Java, the Sync keyword is similar to the “finish” scopeCilk++ extends C++??
An example of thefibonacci sequence computation in cilk, Spawn two threads at each invocation of the function, notice the cilk keyword is used to denote a cilk function,
Cilk++ took away the cilk keyword, prefixed cilk_ to spawn and sync
Directed acyclic graphSpawn creates parallel executions, B and C, they join together and recombine to execute D
Work:The time needed to execute the program serialyParallel slackness assumption: number of processors is much smaller than average degree of parallelism
To support dynamic task creation
The Cilk runtime uses a specialworkstealing scheduler, There are two kinds of schedulers, worksharing, where all the workers steal from a unified task queue, it is less efficient for a number of reasons, There is a single lock potentially on the task queue to deal with contentionsThe queue could be empty, but there are still work leftThe workstealing runtime solvees the problem by building an extended deque for each worker, when a worker is out of work, it steals randomly from other workersWe will demonstrate the process in the next few slidesDecentralized Push work rather than pull work (when necessary)Loop contians a spawn, package child task, stack, single processor 9LAZY TASK CREATION
Steal from the top to reduce contentionSteal from the top to get bigger subtree (divide and conquer), larger task granularity, minimize stealsSteal from the top increase possible locality of the program (cache locality
The reason
All sync statements compile to no-ops because a fast clone never has any children when it is executing, we know at compile time that all previously spawned procedures have completed. Thus, no operations are required for a sync statementBefore it recursively spawns,
Looks a lot like orginal fib (highlight the original sequential code), the rest is bookeepingLittle bit bookeeping, Sig is the signature , included the pointer to the slow clone rountine, fibsig represents the slow cloneEntry point, instruction pointerComes back to the principle we described earlier
Uses fast_fib locally
Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
Pick a random victim v, where v ̸= w. Repeat this step while the deque of v is empty. Remove the oldest call stack from the deque of v, and pro- mote all stack frames to full frames. For every promoted frame, increment the join counter of the parent frame (full by Invariant 3). Make every newly created child the right- most child of its parent. Let loot be the youngest frame that was stolen. Promote the oldest frame now in v’s extended deque to a full frame and make it the rightmost child of loot. Increment loot’s join counter. Execute a resume-full-frame action on loot.
Pick a random victim v, where v ̸= w. Repeat this step while the deque of v is empty. Remove the oldest call stack from the deque of v, and pro- mote all stack frames to full frames. For every promoted frame, increment the join counter of the parent frame (full by Invariant 3). Make every newly created child the right- most child of its parent. Let loot be the youngest frame that was stolen. Promote the oldest frame now in v’s extended deque to a full frame and make it the rightmost child of loot. Increment loot’s join counter. Execute a resume-full-frame action on loot.
Pick a random victim v, where v ̸= w. Repeat this step while the deque of v is empty. Remove the oldest call stack from the deque of v, and pro- mote all stack frames to full frames. For every promoted frame, increment the join counter of the parent frame (full by Invariant 3). Make every newly created child the right- most child of its parent. Let loot be the youngest frame that was stolen. Promote the oldest frame now in v’s extended deque to a full frame and make it the rightmost child of loot. Increment loot’s join counter. Execute a resume-full-frame action on loot.
Joint counter, frames left in heap, (0)Assert that the frame A begin stolen is a full frame and the extended deque is empty. Decrement the join counter of A. If the join counter is 0 and no worker is working on A, execute a resume-full-frame action on A. Otherwise, begin random work stealing.3
Assert that the frame A being stolen is a full frame, the extended deque is empty, and A’s join counter is positive. Decrement the join counter of A. Execute a resume-full- frame action on A.
Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
Just removing a stack frame
This case the full frame has finished execution
Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
This case the full frame has finished execution
Do nothing if it is a stack frame
Do nothing if it is a stack frame
Little modificationsDeterministic output even in the presence of output (associative)
Can be used to parallelize many programs containing global (or nonlocal) variables without locking, atomic updating, or the need to logically restructure the codeThe programmer can count on a deterministic result as long as the reducer operator is associative. Commutability is not requiredReducers opeerateindependenly of any control constructs, such as parallel for, and of any data structures that contribute their values to the final result
Little modificationsDeterministic output even in the presence of output (associative)
Fast clone uses identity view
Example of serial execution
Children of A would be {B, C}Right Sibling of B would be CUser would be view in A,
We distinguish two cases: the “fast path” when C is a stack frame, and the “slow path” when C is a full framebecause both P and C share the view stored in the map at the head of the deque to which both P and C belong. which transfers ownership of child views to the parent. The other two hypermaps of C are guaranteed to be empty and do not participate in the update
Set continuation in original proc’s stack frameAllocates a stack frame for BPushes B’s stack frame to the tail of deque
Just removing a stack frame
We distinguish two cases: the “fast path” when C is a stack frame, and the “slow path” when C is a full framebecause both P and C share the view stored in the map at the head of the deque to which both P and C belong. which transfers ownership of child views to the parent. The other two hypermaps of C are guaranteed to be empty and do not participate in the update
This case the full frame has finished execution
We distinguish two cases: the “fast path” when C is a stack frame, and the “slow path” when C is a full framebecause both P and C share the view stored in the map at the head of the deque to which both P and C belong. which transfers ownership of child views to the parent. The other two hypermaps of C are guaranteed to be empty and do not participate in the update
Again we distinguish the “fast path” when C is a stack frame from the “slow path” when C is a full frame:
If proc B finishes first,
If proc B finishes first, the results would be in children of A, If C finishes, it would be the left most, Children of A would just be a union of current children of A and UserCTwo of them are leftmost case
When C finishesC has a right sibling, B, so the result of C is accumulated into Right BWhen B finishes, the children of A has UserB
1. Doing nothing is correct because all children of P, if any exist, were stack frames, and thus they transferred ownership of their views to P when they completed. Thus, no outstanding child views exist that must be reduced into P’s. 2. Then after P passes the cilk_sync state- ment but before executing any client code, we perform the update. This up- date reduces all reducers of completed children into the parent.
Comparing reducers against mutual exclusion
Future scaling with dynmiac parallelismProvides a simple way to add incremental parallelismIncremental parallelization of programsInspired many future works, such as Habanero Java, Habanero C, X10,
Eagerly saving all the state, gather the states using an Exception when they make a steal