Miller Lee discusses C++ Accelerated Massive Parallelism (C++ AMP) which provides a simpler programming model for GPU computing compared to CUDA and OpenCL. C++ AMP models GPU data as C++ containers and kernels as C++ lambdas. The MCW C++ AMP (CLAMP) compiler translates C++ AMP code to OpenCL, generating OpenCL C code for the device and host code for execution. While C++ AMP code is more concise than OpenCL, its performance depends on the compiler and runtime support.
The document discusses different approaches to implementing GPU-like programming on CPUs using C++AMP. It covers using setjmp/longjmp to implement coroutines for "fake threading", using ucontext for coroutine context switching, and how to pass lambda functions and non-integer arguments to makecontext. Implementing barriers on CPUs requires synchronizing threads with an atomic counter instead of GPU shared memory. Overall, the document shows it is possible to run GPU-like programming models on CPUs by simulating the GPU programming model using language features for coroutines and threading.
C++ How I learned to stop worrying and love metaprogrammingcppfrug
Cette présentation parcours quelques applications directes de la méta-programmation en C++(11/14) avec comme objectif de démontrer son utilité dans un cadre applicatif.
This document provides an introduction and overview of parallel programming with CUDA and OpenCL. It outlines key topics like the CUDA and OpenCL architectures, programming languages, and terminology. Code examples are provided to demonstrate basic CUDA concepts like kernels, memory allocation, and using blocks and threads to perform array addition in parallel on the GPU. The document also briefly discusses matrix multiplication and optimizations like using shared memory.
HSA enables more efficient compilation of high-level programming interfaces like OpenACC and C++AMP. For OpenACC, HSA provides flexibility in implementing data transfers and optimizing nested parallel loops. For C++AMP, HSA allows efficient compilation from an even higher level interface where GPU data and kernels are modeled as C++ containers and lambdas, without needing to specify data transfers. Overall, HSA aims to reduce boilerplate code for heterogeneous programming and provide better portability across devices.
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
The document describes how the nGraph TensorFlow bridge works by rewriting TensorFlow graphs to run on Intel nGraph backends. It discusses how optimization passes are used to modify the graph in several phases: 1) Capturing TensorFlow variables as nGraph variables, 2) Marking/assigning/deassigning nodes to clusters, 3) Encapsulating clusters into nGraphEncapsulateOp nodes to run subgraphs on nGraph. Key classes and files involved are described like NGraphVariableCapturePass, NGraphEncapsulatePass, and how they implement the different rewriting phases to prepare the graph for nGraph execution.
Vc4c development of opencl compiler for videocore4nomaddo
This document discusses the development of an OpenCL compiler called VC4C for the VideoCore IV GPU found in the Raspberry Pi. It provides an overview of the VC4 architecture including its quad processing units, texture and memory lookup unit, uniform cache, and vertex pipe memory. It then introduces VC4C as an open-source project that compiles OpenCL to optimized assembly for the VC4. Several challenges are discussed such as limited registers, cache incoherency, and complex iteration patterns from OpenCL IDs. Optimization techniques explored include constant handling, vectorization, kernel fusion, and software pipelining. In conclusion, VC4C remains a work in progress but provides an opportunity for compiler optimization on an unoptimized
Tiramisu is a code optimization and generation framework that can be integrated into custom compilers. It supports various backends including multi-CPU (using LLVM), GPU (using CUDA), distributed systems (using MPI), and FPGAs (using Xilinx Vivado HLS). Tiramisu uses polyhedral representations to support irregular domains beyond just rectangles. The document provides an overview of Tiramisu and discusses challenges related to supporting different platforms, memory dependencies, efficient code generation, and representations. It also mentions that Tiramisu uses Halide and ISL.
The document discusses histograms and image segmentation techniques in computer vision. It includes code to generate histograms from grayscale, RGB, and live camera images. For grayscale images, the histogram shows pixel intensity distribution. For RGB images, separate histograms are generated for the red, green, and blue channels. Analyses explain that brighter images have histograms skewed right while darker images are skewed left. Live camera histograms change in real-time based on captured scene lighting and colors.
The document discusses different approaches to implementing GPU-like programming on CPUs using C++AMP. It covers using setjmp/longjmp to implement coroutines for "fake threading", using ucontext for coroutine context switching, and how to pass lambda functions and non-integer arguments to makecontext. Implementing barriers on CPUs requires synchronizing threads with an atomic counter instead of GPU shared memory. Overall, the document shows it is possible to run GPU-like programming models on CPUs by simulating the GPU programming model using language features for coroutines and threading.
C++ How I learned to stop worrying and love metaprogrammingcppfrug
Cette présentation parcours quelques applications directes de la méta-programmation en C++(11/14) avec comme objectif de démontrer son utilité dans un cadre applicatif.
This document provides an introduction and overview of parallel programming with CUDA and OpenCL. It outlines key topics like the CUDA and OpenCL architectures, programming languages, and terminology. Code examples are provided to demonstrate basic CUDA concepts like kernels, memory allocation, and using blocks and threads to perform array addition in parallel on the GPU. The document also briefly discusses matrix multiplication and optimizations like using shared memory.
HSA enables more efficient compilation of high-level programming interfaces like OpenACC and C++AMP. For OpenACC, HSA provides flexibility in implementing data transfers and optimizing nested parallel loops. For C++AMP, HSA allows efficient compilation from an even higher level interface where GPU data and kernels are modeled as C++ containers and lambdas, without needing to specify data transfers. Overall, HSA aims to reduce boilerplate code for heterogeneous programming and provide better portability across devices.
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
The document describes how the nGraph TensorFlow bridge works by rewriting TensorFlow graphs to run on Intel nGraph backends. It discusses how optimization passes are used to modify the graph in several phases: 1) Capturing TensorFlow variables as nGraph variables, 2) Marking/assigning/deassigning nodes to clusters, 3) Encapsulating clusters into nGraphEncapsulateOp nodes to run subgraphs on nGraph. Key classes and files involved are described like NGraphVariableCapturePass, NGraphEncapsulatePass, and how they implement the different rewriting phases to prepare the graph for nGraph execution.
Vc4c development of opencl compiler for videocore4nomaddo
This document discusses the development of an OpenCL compiler called VC4C for the VideoCore IV GPU found in the Raspberry Pi. It provides an overview of the VC4 architecture including its quad processing units, texture and memory lookup unit, uniform cache, and vertex pipe memory. It then introduces VC4C as an open-source project that compiles OpenCL to optimized assembly for the VC4. Several challenges are discussed such as limited registers, cache incoherency, and complex iteration patterns from OpenCL IDs. Optimization techniques explored include constant handling, vectorization, kernel fusion, and software pipelining. In conclusion, VC4C remains a work in progress but provides an opportunity for compiler optimization on an unoptimized
Tiramisu is a code optimization and generation framework that can be integrated into custom compilers. It supports various backends including multi-CPU (using LLVM), GPU (using CUDA), distributed systems (using MPI), and FPGAs (using Xilinx Vivado HLS). Tiramisu uses polyhedral representations to support irregular domains beyond just rectangles. The document provides an overview of Tiramisu and discusses challenges related to supporting different platforms, memory dependencies, efficient code generation, and representations. It also mentions that Tiramisu uses Halide and ISL.
The document discusses histograms and image segmentation techniques in computer vision. It includes code to generate histograms from grayscale, RGB, and live camera images. For grayscale images, the histogram shows pixel intensity distribution. For RGB images, separate histograms are generated for the red, green, and blue channels. Analyses explain that brighter images have histograms skewed right while darker images are skewed left. Live camera histograms change in real-time based on captured scene lighting and colors.
TVM uses Verilator and DPI to connect Verilog/Chisel accelerator models written in SystemVerilog/Chisel to Python code. It initializes the hardware model and controls simulation using methods like SimLaunch, SimWait, SimResume. The Python code loads the accelerator module, allocates memory, runs the accelerator by calling driver functions that interface with the DPI to initialize, launch and wait for completion of the accelerator. This allows accelerators developed in Verilog/Chisel to be tested from Python.
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
This document provides an overview of OpenCL libraries for GPU programming. It discusses specialized GPU libraries like clFFT for fast Fourier transforms and Random123 for random number generation. It also covers general GPU libraries like Bolt, OpenCV, and ArrayFire. ArrayFire is highlighted as it provides a flexible array data structure and hundreds of parallel functions across domains like image processing, machine learning, and linear algebra. It supports JIT compilation and data-parallel constructs like GFOR to improve performance.
How to make a large C++-code base manageablecorehard_by
My talk will cover how to work with a large C++ code base professionally. How to write code for debuggability, how to work effectively even due the long C++ compilation times, how and why to utilize the STL algorithms, how and why to keep interfaces clean. In addition, general convenience methods like making wrappers to make the code less error prone (for example ranged integers, listeners, concurrent values). Also a little bit about common architecture patterns to avoid (virtual classes), and patterns to encourage (pure functions), and how std::function/lambda functions can be used to make virtual classes copyable.
Евгений Крутько, Многопоточные вычисления, современный подход.Platonov Sergey
The document discusses parallel computing in modern C++. It introduces native threads, standard threads in C++11, thread pools, std::async, and examples of parallelizing real applications. It also covers potential issues like data races and tools for detecting them like Valgrind and ThreadSanitizer. Finally, it recommends using std::async, std::future and boost::thread for flexibility and OpenMP for ease of use.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Glow is a compiler and execution engine for neural networks created by Facebook. It takes a high-level graph representation of a neural network and compiles it into efficient machine code for different hardware backends like CPU and OpenCL. The key steps in Glow include loading a model, optimizing the graph, lowering it to a low-level IR, scheduling operations to minimize memory usage, generating instructions for the backend, and performing optimizations specific to the target. Glow aims to provide a portable way to deploy neural networks across different hardware platforms.
Недавно работы комитета по стандартизации WG21 были завершены, и документ-черновик C++17 был отправлен на рассмотрение в Международную организацию по стандартизации (ISO). С этого момента технически можно считать, что стандарт C++17 у нас есть. Если вы ещё ознакомились с принятыми изменениями, то сейчас для этого самое время. В докладе будет сделан обзор нововведений. Рассмотрено текущее состояние дел у популярных компиляторов с поддержкой С++17
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
The talk will look at limitations of compilers when creating fast code and how to make more effective use of both the underlying micro-architecture of modern CPU's and how algorithmic optimizations may have surprising effects on the generated code. We shall discuss several specific CPU architecture features and their pros and cons in relation to creating fast C++ code. We then expand with several algorithmic techniques, not usually well-documented, for making faster, compiler friendly, C++.
Note that we shall not discuss caching and related issues here as they are well documented elsewhere.
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)changehee lee
The document discusses techniques for robust versioning of serialized data across changes to data structures and classes. It proposes a snapshot-based versioning system that records how to update data between different snapshots by comparing metadata and defining update functions. This allows chaining updates between multiple previous versions to reach the current one without constraining design changes or requiring conditional logic based on version numbers.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
The document discusses intra-machine parallelism and threaded programming. It introduces key concepts like threads, processes, synchronization constructs (locks and condition variables), and challenges like overhead and Amdahl's law. An example of domain decomposition for parallel rendering is presented to demonstrate how to divide a problem into independent tasks and assign them to threads.
This document provides an introduction to GDB (GNU Debugger) including what it is, why it is useful, basic GDB commands, and examples of using GDB to debug a C program. Key points include:
- GDB is an interactive debugger that allows debugging of C/C++ programs.
- It helps developers find bugs by allowing them to watch/modify variables, determine why programs fail, and change program flow.
- Basic GDB commands demonstrated include breakpoints, backtraces, printing variables, and stepping through code.
- An example program is debugged using GDB to step through functions and view variable values.
This document discusses embedded system development. It begins with definitions of embedded systems and some of their common characteristics like limited resources and real-time constraints. It then discusses specific issues like memory alignment, flash and RAM sizes, and performance optimizations. Examples are given of embedded projects like digital video recorders and how to address issues like file sorting, memory usage and stack overflows. The conclusion emphasizes that embedded systems involve knowledge from many technical fields and stresses the importance of experience, observation, and a positive problem-solving attitude.
Работа с реляционными базами данных в C++corehard_by
The document discusses various C++ libraries for working with relational databases, including native database clients, third-party libraries, and what may be on the horizon. It covers libraries for PostgreSQL, MySQL, Oracle, Microsoft SQL Server, and others. It provides code examples for connecting to a database and executing queries using libraries like QtSQL, Poco::Data, OTL, SOCI, and Sqlpp11. It also mentions a proposed new library called cppstddb that aims to provide a standardized C++ interface for databases.
Story of static code analyzer developmentAndrey Karpov
The document discusses the history and development of static code analyzers. It describes how early tools used regular expressions that were ineffective for complex code analysis. Modern static analyzers overcome these limitations through techniques like type inference, data flow analysis, symbolic execution, and pattern-based analysis. They also leverage method annotations and a mixture of analysis approaches. While machine learning is hyped, static analysis remains very challenging due to the complexity of code and rapid language evolution.
The document provides sample code examples for key Node.js concepts including prototype-based object-oriented programming, asynchronous programming with callbacks, promises, and async/await, automated testing with Mocha and Chai, and using TypeScript with Node.js. The examples cover topics such as object prototypes, classes, timers, promises, generator functions, generics, and writing automated tests. Useful links are also provided for further learning Node.js, asynchronous programming, testing, and TypeScript.
【論文紹介】Relay: A New IR for Machine Learning FrameworksTakeo Imai
The document introduces Relay, a new intermediate representation (IR) for machine learning frameworks. Relay aims to provide both the static graph optimizations of frameworks like TensorFlow as well as the dynamic graph expressiveness of frameworks like PyTorch. It serves as a common IR that can be lowered to hardware backends like CUDA, OpenCL, and deployed models.
Introduce Brainf*ck, another Turing complete programming language. Then, try to implement the following from scratch: Interpreter, Compiler [x86_64 and ARM], and JIT Compiler.
The document provides an overview of OpenCL, including:
- OpenCL allows programs to execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
- It defines an programming model for parallel computation along with a framework API for controlling devices and allocating memory.
- The OpenCL framework handles compiling programs for different devices and scheduling work across processors. It provides interfaces for querying platforms and devices, creating contexts, and managing memory and command queues.
- OpenCL aims to standardize parallel programming and overcome the need to learn separate APIs for each type of hardware as processors evolve with increasing core counts.
A Sensing Coverage Analysis of a Route Control Method for Vehicular Crowd Sen...Osamu Masutani
The document proposes and evaluates route control methods for vehicular crowd sensing to maximize sensing coverage of a city. It presents three key ideas: (1) modifying vehicle routes to pass through areas of high sensing demand, (2) reserving routes to avoid traffic concentration, and (3) using predictive reservations for longer routes. The methodology and evaluation show that these methods can enhance coverage without significantly increasing travel time, especially for static and uniform demands. Future work includes optimization techniques and more realistic simulations.
TVM uses Verilator and DPI to connect Verilog/Chisel accelerator models written in SystemVerilog/Chisel to Python code. It initializes the hardware model and controls simulation using methods like SimLaunch, SimWait, SimResume. The Python code loads the accelerator module, allocates memory, runs the accelerator by calling driver functions that interface with the DPI to initialize, launch and wait for completion of the accelerator. This allows accelerators developed in Verilog/Chisel to be tested from Python.
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
This document provides an overview of OpenCL libraries for GPU programming. It discusses specialized GPU libraries like clFFT for fast Fourier transforms and Random123 for random number generation. It also covers general GPU libraries like Bolt, OpenCV, and ArrayFire. ArrayFire is highlighted as it provides a flexible array data structure and hundreds of parallel functions across domains like image processing, machine learning, and linear algebra. It supports JIT compilation and data-parallel constructs like GFOR to improve performance.
How to make a large C++-code base manageablecorehard_by
My talk will cover how to work with a large C++ code base professionally. How to write code for debuggability, how to work effectively even due the long C++ compilation times, how and why to utilize the STL algorithms, how and why to keep interfaces clean. In addition, general convenience methods like making wrappers to make the code less error prone (for example ranged integers, listeners, concurrent values). Also a little bit about common architecture patterns to avoid (virtual classes), and patterns to encourage (pure functions), and how std::function/lambda functions can be used to make virtual classes copyable.
Евгений Крутько, Многопоточные вычисления, современный подход.Platonov Sergey
The document discusses parallel computing in modern C++. It introduces native threads, standard threads in C++11, thread pools, std::async, and examples of parallelizing real applications. It also covers potential issues like data races and tools for detecting them like Valgrind and ThreadSanitizer. Finally, it recommends using std::async, std::future and boost::thread for flexibility and OpenMP for ease of use.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Glow is a compiler and execution engine for neural networks created by Facebook. It takes a high-level graph representation of a neural network and compiles it into efficient machine code for different hardware backends like CPU and OpenCL. The key steps in Glow include loading a model, optimizing the graph, lowering it to a low-level IR, scheduling operations to minimize memory usage, generating instructions for the backend, and performing optimizations specific to the target. Glow aims to provide a portable way to deploy neural networks across different hardware platforms.
Недавно работы комитета по стандартизации WG21 были завершены, и документ-черновик C++17 был отправлен на рассмотрение в Международную организацию по стандартизации (ISO). С этого момента технически можно считать, что стандарт C++17 у нас есть. Если вы ещё ознакомились с принятыми изменениями, то сейчас для этого самое время. В докладе будет сделан обзор нововведений. Рассмотрено текущее состояние дел у популярных компиляторов с поддержкой С++17
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
The talk will look at limitations of compilers when creating fast code and how to make more effective use of both the underlying micro-architecture of modern CPU's and how algorithmic optimizations may have surprising effects on the generated code. We shall discuss several specific CPU architecture features and their pros and cons in relation to creating fast C++ code. We then expand with several algorithmic techniques, not usually well-documented, for making faster, compiler friendly, C++.
Note that we shall not discuss caching and related issues here as they are well documented elsewhere.
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)changehee lee
The document discusses techniques for robust versioning of serialized data across changes to data structures and classes. It proposes a snapshot-based versioning system that records how to update data between different snapshots by comparing metadata and defining update functions. This allows chaining updates between multiple previous versions to reach the current one without constraining design changes or requiring conditional logic based on version numbers.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
The document discusses intra-machine parallelism and threaded programming. It introduces key concepts like threads, processes, synchronization constructs (locks and condition variables), and challenges like overhead and Amdahl's law. An example of domain decomposition for parallel rendering is presented to demonstrate how to divide a problem into independent tasks and assign them to threads.
This document provides an introduction to GDB (GNU Debugger) including what it is, why it is useful, basic GDB commands, and examples of using GDB to debug a C program. Key points include:
- GDB is an interactive debugger that allows debugging of C/C++ programs.
- It helps developers find bugs by allowing them to watch/modify variables, determine why programs fail, and change program flow.
- Basic GDB commands demonstrated include breakpoints, backtraces, printing variables, and stepping through code.
- An example program is debugged using GDB to step through functions and view variable values.
This document discusses embedded system development. It begins with definitions of embedded systems and some of their common characteristics like limited resources and real-time constraints. It then discusses specific issues like memory alignment, flash and RAM sizes, and performance optimizations. Examples are given of embedded projects like digital video recorders and how to address issues like file sorting, memory usage and stack overflows. The conclusion emphasizes that embedded systems involve knowledge from many technical fields and stresses the importance of experience, observation, and a positive problem-solving attitude.
Работа с реляционными базами данных в C++corehard_by
The document discusses various C++ libraries for working with relational databases, including native database clients, third-party libraries, and what may be on the horizon. It covers libraries for PostgreSQL, MySQL, Oracle, Microsoft SQL Server, and others. It provides code examples for connecting to a database and executing queries using libraries like QtSQL, Poco::Data, OTL, SOCI, and Sqlpp11. It also mentions a proposed new library called cppstddb that aims to provide a standardized C++ interface for databases.
Story of static code analyzer developmentAndrey Karpov
The document discusses the history and development of static code analyzers. It describes how early tools used regular expressions that were ineffective for complex code analysis. Modern static analyzers overcome these limitations through techniques like type inference, data flow analysis, symbolic execution, and pattern-based analysis. They also leverage method annotations and a mixture of analysis approaches. While machine learning is hyped, static analysis remains very challenging due to the complexity of code and rapid language evolution.
The document provides sample code examples for key Node.js concepts including prototype-based object-oriented programming, asynchronous programming with callbacks, promises, and async/await, automated testing with Mocha and Chai, and using TypeScript with Node.js. The examples cover topics such as object prototypes, classes, timers, promises, generator functions, generics, and writing automated tests. Useful links are also provided for further learning Node.js, asynchronous programming, testing, and TypeScript.
【論文紹介】Relay: A New IR for Machine Learning FrameworksTakeo Imai
The document introduces Relay, a new intermediate representation (IR) for machine learning frameworks. Relay aims to provide both the static graph optimizations of frameworks like TensorFlow as well as the dynamic graph expressiveness of frameworks like PyTorch. It serves as a common IR that can be lowered to hardware backends like CUDA, OpenCL, and deployed models.
Introduce Brainf*ck, another Turing complete programming language. Then, try to implement the following from scratch: Interpreter, Compiler [x86_64 and ARM], and JIT Compiler.
The document provides an overview of OpenCL, including:
- OpenCL allows programs to execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
- It defines an programming model for parallel computation along with a framework API for controlling devices and allocating memory.
- The OpenCL framework handles compiling programs for different devices and scheduling work across processors. It provides interfaces for querying platforms and devices, creating contexts, and managing memory and command queues.
- OpenCL aims to standardize parallel programming and overcome the need to learn separate APIs for each type of hardware as processors evolve with increasing core counts.
A Sensing Coverage Analysis of a Route Control Method for Vehicular Crowd Sen...Osamu Masutani
The document proposes and evaluates route control methods for vehicular crowd sensing to maximize sensing coverage of a city. It presents three key ideas: (1) modifying vehicle routes to pass through areas of high sensing demand, (2) reserving routes to avoid traffic concentration, and (3) using predictive reservations for longer routes. The methodology and evaluation show that these methods can enhance coverage without significantly increasing travel time, especially for static and uniform demands. Future work includes optimization techniques and more realistic simulations.
This document provides guidance for Linux administration practicals, including:
- An index of 17 practical topics ranging from basic Linux commands to configuring mail services.
- Detailed instructions for Practical 1 on basic commands like cat, mkdir, cp, and editors like vi. It provides an example directory and file structure to create.
- An overview of Practical 2 on installing Red Hat Linux, including selecting installation options and partitioning the hard drive to make space.
- Descriptions of changing file permissions using both binary and symbolic modes with chmod, and decoding permission codes from the ls command.
- An explanation of the different modes in the vi editor like command, insert, and ex modes,
Linux System Administration Crash CourseJason Cannon
This document provides an overview of a Linux administration crash course that covers key system administration concepts. The course is intended for anyone interested in learning Linux administration, from beginners to experienced professionals. It covers topics such as the Linux boot process, system logging, disk management, user and group management, networking, processes and jobs, scheduling jobs with cron, and managing software. The document promotes enrolling in an in-depth Linux administration course for live instruction and assistance with shell scripting.
The document discusses various topics related to Linux administration. It covers Unix system architecture, the Linux command line, files and directories, running programs, wildcards, text editors, shells, command syntax, filenames, command history, paths, hidden files, home directories, making directories, copying and renaming files, and more. It provides an overview of key Linux concepts and commands for system administration.
The document discusses the Intel Open-source SYCL compiler project. It provides an overview of SYCL (Single-source heterogeneous programming using standard C++), the Intel Open-source SYCL project location and architecture, and its plugin interface and future directions. The project aims to contribute SYCL compiler support to the LLVM compiler infrastructure to promote SYCL as the open standard for heterogeneous programming and foster collaboration in the industry.
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
Kiran Lonikar proposes extending Project Tungsten in Spark SQL to enable parallel execution of DataFrame operations on GPUs. The proposal involves refactoring DataFrames to use a columnar layout and generating OpenCL code for batched execution across columns. Initial results show speedups from GPU execution. Future work includes supporting multi-GPU execution and adapting additional systems like Impala that may be better suited than Spark for GPU integration.
Introduction to cuda geek camp singapore 2011Raymond Tay
This document provides an introduction to CUDA (Compute Unified Device Architecture). It discusses that GPUs have advantages over CPUs for parallel computing due to their optimized architecture and large number of cores. It explains how CUDA works by offloading parts of a program to run on GPU memory and cores. An example of a block cipher encryption is provided to illustrate a CPU and GPU program for the same task. Additional CUDA concepts covered include debugging tools, adoption rates, and libraries.
2011.02.18 marco parenzan - modelli di programmazione per le gpuMarco Parenzan
This document discusses programming languages and compilers for GPUs. It begins by noting that Nvidia/CUDA is not the only option for GPU computing and that other platforms like Intel and AMD exist. It then explains why GPU computing is useful due to GPUs' parallel processing capabilities. The document outlines some popular GPU products and programming models like CUDA, OpenCL, and DirectCompute. It provides kernel code examples for tasks like matrix multiplication in languages like C for CUDA and OpenCL. Finally, it discusses related topics like GPU programming libraries, host languages, and metaprogramming.
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaJuan Fumero
This document discusses runtime and data management techniques for heterogeneous computing in Java. It presents an approach that uses three levels of abstraction: parallel skeletons API based on functional programming, a high-level optimizing library that rewrites operations to target specific hardware, and OpenCL code generation and runtime with data management for heterogeneous architectures. It describes how the runtime performs type inference, IR generation, optimizations, and kernel generation to compile Java code into OpenCL kernels. It also discusses how custom array types are used to reduce data marshaling overhead between the Java and OpenCL runtimes.
The document provides an overview of new features in C++20, including small language and library improvements. Some key points summarized:
1. Aggregate initialization allows initialization of aggregates using designated initializers (e.g. {.a = 3, .c = 7}) and direct initialization syntax (e.g. Widget w(1,2)).
2. Structured bindings allow capturing initialized variables from aggregates into auto variables (e.g. auto [a,b] = getWidget()).
3. Lambdas can be used in more contexts like static/thread_local variables and allow capturing initialized variables. Templates see expanded use of generic lambdas, non-type template parameters, and class
CUDA is a parallel computing platform and programming model developed by Nvidia that allows software developers and researchers to utilize GPUs for general purpose processing. CUDA allows developers to achieve up to 100x performance gains over CPU-only applications. CUDA works by having the CPU copy input data to GPU memory, executing a kernel program on the GPU that runs in parallel across many threads, and copying the results back to CPU memory. Key GPU memories that can be used in CUDA programs include shared memory for thread cooperation, textures for cached reads, and constants for read-only data.
The document describes a presentation about data processing with CDK (Cloud Development Kit). It includes an agenda that covers CDK and Projen, serverless ETL with Glue, Databrew with continuous integration/continuous delivery (CICD), and using Amazon Comprehend with S3 object lambdas. Constructs are demonstrated for building architectures with CDK across multiple programming languages. Examples are provided of using CDK to implement Glue workflows, Databrew CICD pipelines, and combining Comprehend with S3 object lambdas for PII detection and redaction.
The document discusses how scripting languages like Python, R, and MATLAB can be used to script CUDA and leverage GPUs for parallel processing. It provides examples of libraries like pyCUDA, rGPU, and MATLAB's gpuArray that allow these scripting languages to interface with CUDA and run code on GPUs. The document also compares different parallelization approaches like SMP, MPI, and GPGPU and levels of parallelism from nodes to vectors that can be exploited.
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docxeugeniadean34240
20145-5SumII_CSC407_assign1.html
CSC 407: Computer Systems II: 2015 Summer II, Assignment #1
Last Modified 2015 July 21Purpose:
To go over issues related to how the compiler and the linker
serve you, the programmer.
Computing
Please ssh into ctilinux1.cstcis.cti.depaul.edu, or use your own Linux machine.
Compiler optimization (45 Points)
Consider the following program.
/* q1.c
*/
#include <stdlib.h>
#include <stdio.h>
#define unsigned int uint
#define LENGTH ((uint) 512*64)
int initializeArray (uint len,
int* intArray
)
{
uint i;
for (i = 0; i < len; i++)
intArray[i] = (rand() % 64);
}
uint countAdjacent (int maxIndex,
int* intArray,
int direction
)
{
uint i;
uint sum = 0;
for (i = 0; i < maxIndex; i++)
if ( ( intArray[i] == (intArray[i+1] + direction) ) &&
( intArray[i] == (intArray[i+2] + 2*direction) )
)
sum++;
return(sum);
}
uint funkyFunction (uint len,
int* intArray
)
{
uint i;
uint sum = 0;
for (i = 0; i < len-1; i++)
if ( (i % 8) == 0x3 )
sum += 7*countAdjacent(len-2,intArray,+1);
else
sum += 17*countAdjacent(len-2,intArray,-1);
return(sum);
}
int main ()
{
int* intArray = (int*)calloc(LENGTH,sizeof(int));
initializeArray(LENGTH,intArray);
printf("funkyFunction() == %d\n",funkyFunction(LENGTH,intArray));
free(intArray);
return(EXIT_SUCCESS);
}
(8 Points) Compile it for profiling but with no extra optimization with:
$ gcc -o q1None -pg q1.c # Compiles q1.c to write q1None to make profile info
$ ./q1None # Runs q1None
$ gprof q1None # Gives profile info on q1None
Be sure to scroll all the way to the top of gprof output!
What are the number of self seconds taken by:
FunctionSelf secondsinitializeBigArray()__________countAdjaceent()__________funkyFunction()__________
(8 Points)
How did it do the operation (i % 8) == 0x3?
Was it done as a modulus (the same as an expensive division, but returns the remainder instead of the quotient) or something else?
Show the assembly language for this C code
using gdb to dissassemble
funkyFunction() of q1None.
Hint: do:
$ gdb q1None
. . .
(gdb) disass funkyFunction
Dump of assembler code for function funkyFunction:
. . .
and then look for the code that sets up the calls to countAdjacent().
The (i % 8) == 0x3 test is done before either countAdjacent() call.
(8 Points) Compile it for profiling but with optimization with:
$ gcc -o q1Compiler -O1 -pg q1.c # Compiles q1.c to write q1Compiler to make profile info
$ ./q1Compiler # Runs q1Compiler
$ gprof q1Compiler # Gives profile info on q1Compiler
What are the number of self seconds taken by:
FunctionSelf secondsinitializeBigArray()__________countAdjacent()__________funkyFunction()__________(8 Points) Use gdb to dissassemble countAdjacent() of both q1None and q1.
The document discusses various compiler data structure nodes used in the Java HotSpot VM compiler such as Node, RegionNode, PhiNode, and LoopNode. It describes how nodes are connected and how methods like Ideal, Value, and Identity are used to optimize nodes. The nodes represent an intermediate representation of the code during compilation and are manipulated throughout the various phases of compilation.
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2PVS-Studio
The document summarizes analysis of the Haiku operating system using the PVS-Studio static analyzer. Various bugs and issues were detected, including: incorrect string handling, bad loops, improper use of variables with the same name, array overruns, unsafe memory handling, and other logical errors. The analyzer identified multiple areas for improvement to enhance code quality and eliminate potential bugs.
100 bugs in Open Source C/C++ projects Andrey Karpov
This article demonstrates capabilities of the static code analysis methodology. The readers are offered to study the samples of one hundred errors found in open-source projects in C/C++.
This document discusses parallel programming using GPUs and MICs in Python, R, and MATLAB. It begins with background on Moore's law and the end of processor speed increases, leading to a focus on parallelism. It then covers different parallel architectures like SMP, MPI, and GPGPU and tools to program them in the three scripting languages. Specific libraries and functions are discussed, like OpenMP, CUDA, pyCUDA, rgpu, and GPU arrays in MATLAB. Examples and performance comparisons demonstrate using the GPU for linear algebra and statistical operations in R. Overall issues around parallelism, Amdahl's law, and hierarchical hardware and software architectures are covered.
The document compares CPU and GPU processing and discusses how to write programs that utilize GPUs through C++ AMP. It shows an example of performing element-wise multiplication on an array in parallel using C++ AMP and GPU processing versus serial CPU processing. It also discusses key concepts of C++ AMP like array views, extents, indexes and the restrict keyword.
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
HotSpot promises to do the "right" thing for us by identifying our hot code and compiling "just-in-time", but how does HotSpot make those decisions?
This presentation aims to detail how HotSpot makes those decisions and how it corrects its mistakes through a series of demos that you run yourself.
The document discusses various approaches for communicating configuration data between a quadcopter's microcontroller and a JavaScript application controlling it via Electron. It proposes using binary packet structures with headers, codes, lengths, data and CRC to reliably transmit and receive data. Sample C structs are provided for storing configuration on the microcontroller. JavaScript would use ArrayBuffers and DataViews to parse similar data from packets. Examples of requesting, setting and displaying configuration and motor data between the two systems using this packet method are described.
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
7. What we need in GPU programming
1. put data parallel codes into a kernel for GPU
to execute
2. pass the arguments to GPU
○ We can not pass the arguments by stack
3. an index to indicate current thread
4. move the data between GPU and CPU
memory
9. Device code in OpenCL
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;
}
10. Host code in OpenCL 1.2
1. allocate and initialize memory on host side
2. Initialize OpenCL
3. allocate device memory and move the data
4. Load and build device code
5. Launch kernel
a. append arguments
6. move the data back from device
13. What is C++ AMP
● C++ Accelerated Massive Parallelism
○ Designed for data level parallelism
○ Extension of C++11 proposed by M$
○ An open specification with multiple implementations
aiming at standardization
■ MS Visual Studio 2013
■ MCW CLAMP
● GPU data modeled as C++14-like containers for
multidimensional arrays
● GPU kernels modeled as C++11 lambda
14. Comparisons
C++AMP Thrust Bolt OpenACC SYCL
Intro
simple, elegant,
performance(?),
proposed by M$
library
proposed
by CUDA
library
proposed
by AMP
Annotation
and
pragmas
proposed
by SGI
wrapper for
OpenCL
proposed
by
Codeplay
15. Matrix Multiplication in C++AMP
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int
*productMatrix,
int ha, int hb, int hc) {
array_view<int, 2> a(ha, hb, aMatrix);
array_view<int, 2> b(hb, hc, bMatrix);
array_view<int, 2> product(ha, hc, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();
}
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// get the list of GPU devices associated
// with context
errcode = clGetContextInfo(clGPUContext,
__kernel void
matrixMul(__global float* C, __global float*
A,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
__global float* B, int wA, int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
shrCheckError(errcode, CL_SUCCESS);
//Create a command-queue
clCommandQue = clCreateCommandQueue
(clGPUContext,
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;
}
clDevices[0], 0, &errcode);
shrCheckError(errcode, CL_SUCCESS);
17. C++AMP programming model
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();
}
GPU data modeled
as data container
18. C++AMP programming model
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();
}
Execution interface;
marking an implicitly
parallel region for GPU
execution
19. C++AMP programming model
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();
}
Kernels modeled as
lambdas; arguments
are implicitly modeled
as captured variables
20. MCW C++AMP (CLAMP)
● Clang/LLVM-based
○ translate C++AMP code to OpenCL C code and
generate OpenCL SPIR file
○ With some template library
● Runtime support: gmac/OpenCL/HSA Okra
● An Open Source project
○ The only two C++ AMP implementations recognized
by HSA foundation (the other is MSVC)
○ Microsoft and HSA foundation supported
21. MCW C++ AMP Compiler
● Device Path
○ generate OpenCL C code by
CBackend
○ emit kernel function
● Host Path
○ preparation to launch the
code
C++ AMP
source
code
Clang/LLVM 3.3
Device
Code Host Code
22. Execution process
C++ AMP
source
code
Clang
/LLV
M 3.3
Device
Code
C++ AMP
source
code
Clang
/LLV
M 3.3
Host Code
gmac
OpenCL
Our work
23. gmac
● unified virtual address
space in software
● Can have high
overhead sometimes
● In HSA (AMD Kaveri),
GMAC is no longer
needed
24. Compiling C++AMP to OpenCL
● C++AMP → LLVM IR → subset of C
● arguments passing (lambda capture vs
function calls)
● explicit V.S. implicit memory transfer
● Heavy works were done by compiler and
runtime
25. lambda capture
struct add {
int a;
add(int a) : a(a) {}
int operator()(int x) const {
return a + x;
}
};
int main(void)
{
int a = 3;
auto fn = [=] (int x) { return a + x; };
int c = fn(3);
return 0;
}
Those arguments should be put
on the argument lists of OpenCL
kernel.
26. What we need to do?
● Kernel function
○ emit the kernel function with required arguments
● In Host side
○ a function that recursively traverses the object and
append the arguments to OpenCL stack.
● In Device side
○ reconstructor it on the device code for future use.
27. Example
struct A { int a; };
struct B : A { int b; };
struct C { B b; int c; };
struct C c;
c.c = 100;
auto fn = [=] () { int qq = c.c; };
30. Serialization constructor
struct C
{
B b;
int c;
void __cxxamp_serialize(Concurrency::Serialize s) {
b.__cxxamp_serialize(s);
s.Append(sizeof(int), &c);
}
};
31. Translation
parallel_for_each(product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;
}
● Append the arguments
● Set the index
● emit kernel function
● implicit memory management
32.
33. Future work
● Future work for us
○ restrict(auto)
○ HSA related work
34. Future works for you
● Try this out!!
● Many of us get spoiled and don’t want to go
back to write OpenCL directly anymore :-)
● related links
○ Driver
○ Clang
○ sandbox