The document describes Griffon, a GPU programming API for scientific and general purpose applications. Griffon aims to provide a simple programming model like OpenMP while achieving high performance computation on GPUs like CUDA. It uses a source-to-source compiler with directives to generate CUDA code and handles memory management automatically. The document outlines Griffon's objectives, software architecture, directives for parallel regions, kernel flow control, synchronization, and overlapping GPU/CPU computation.
Python always got a good relation with the C language, through its syntax affinity or with its own API integrated with C.
Presentation's goal is to describe and compare several ways of doing bindings in C/C++ for Python which allow to augment Python features through speed improvements or giving access to a large ecosystem of C/C++ (or other) libs.
Following is presented : Python C API, ctypes, SWIG, Cython speaking about qualities and weak points.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
Two-level Just-in-Time Compilation with One Interpreter and One EngineYusuke Izawa
This document proposes a two-level just-in-time compilation approach using one interpreter and one engine. It finds that by providing different interpreter definitions to the RPython meta-tracing compiler, different kinds of compilers and compilations can be derived, such as tracing, method, and threaded code compilers. The key idea is an adaptive RPython system that performs multitier compilation by generating different interpreters from a generic interpreter and driving the RPython engine accordingly. This challenges the assumption in the JIT community that a meta-tracing compiler can only perform tracing compilation.
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
The document presents an approach for accelerating convolutional neural networks (CNNs) using a coarse-grained reconfigurable array (CGRA) called EMAX. EMAX features processing elements with local memory to improve data locality and memory bandwidth utilization. CNN computations like convolutions are mapped to EMAX by assigning weight matrices to constant registers and performing numerous small matrix multiplications in parallel. Evaluation shows EMAX achieves better performance per memory bandwidth and area than GPUs for CNN workloads due to its optimization for small matrix operations.
In this deck, Jem Davies (VP Engineering and ARM Fellow) gives a brief introduction to Machine Learning and explains how it is used in devices such as smartphones, autos, and drones. "I do think that machine learning altogether is probably going to be one of the biggest shifts in computing that we'll see in quite a few years. I'm reluctant to put a number on it like -- the biggest thing in 25 years or whatever," said Jem Davies in a recent investor call. "But this is going to be big. It is going to affect all of us. It affects quite a lot of ARM, in fact."
Watch the video presentation: http://insidehpc.com/2017/03/slidecast-arm-steps-machine-learning/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides information about using high-level programming languages to generate hardware implementations on FPGAs. It discusses how high-level synthesis (HLS) can be used to synthesize register transfer level (RTL) descriptions from C/C++ or Python code. This allows hardware to be programmed at a higher level of abstraction without having to manually write RTL code. Specific HLS tools mentioned include Xilinx Vivado HLS, Altera OpenCL, Veriloggen for Python, and synthesizing hardware from languages like C, C++, Java, and Python.
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
In this deck from the HPC User Forum at Argonne, Deepak Pathania presents: SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in Server Chassis.
"The NEC Vector Engine Processor was developed using 16 nm FinFET process technology for extreme high performance and low power consumption. The Vector Engine Processor has the world's first implementation of one processor with six HBM2 memory modules using Chip-on-Wafer-on-Substrate technology, leading to the world-record memory bandwidth of 1.2 TB/s."
Watch the video: https://wp.me/p3RLHQ-kOK
Learn more: https://www.nec.com/en/global/solutio...
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Python always got a good relation with the C language, through its syntax affinity or with its own API integrated with C.
Presentation's goal is to describe and compare several ways of doing bindings in C/C++ for Python which allow to augment Python features through speed improvements or giving access to a large ecosystem of C/C++ (or other) libs.
Following is presented : Python C API, ctypes, SWIG, Cython speaking about qualities and weak points.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
Two-level Just-in-Time Compilation with One Interpreter and One EngineYusuke Izawa
This document proposes a two-level just-in-time compilation approach using one interpreter and one engine. It finds that by providing different interpreter definitions to the RPython meta-tracing compiler, different kinds of compilers and compilations can be derived, such as tracing, method, and threaded code compilers. The key idea is an adaptive RPython system that performs multitier compilation by generating different interpreters from a generic interpreter and driving the RPython engine accordingly. This challenges the assumption in the JIT community that a meta-tracing compiler can only perform tracing compilation.
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
The document presents an approach for accelerating convolutional neural networks (CNNs) using a coarse-grained reconfigurable array (CGRA) called EMAX. EMAX features processing elements with local memory to improve data locality and memory bandwidth utilization. CNN computations like convolutions are mapped to EMAX by assigning weight matrices to constant registers and performing numerous small matrix multiplications in parallel. Evaluation shows EMAX achieves better performance per memory bandwidth and area than GPUs for CNN workloads due to its optimization for small matrix operations.
In this deck, Jem Davies (VP Engineering and ARM Fellow) gives a brief introduction to Machine Learning and explains how it is used in devices such as smartphones, autos, and drones. "I do think that machine learning altogether is probably going to be one of the biggest shifts in computing that we'll see in quite a few years. I'm reluctant to put a number on it like -- the biggest thing in 25 years or whatever," said Jem Davies in a recent investor call. "But this is going to be big. It is going to affect all of us. It affects quite a lot of ARM, in fact."
Watch the video presentation: http://insidehpc.com/2017/03/slidecast-arm-steps-machine-learning/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides information about using high-level programming languages to generate hardware implementations on FPGAs. It discusses how high-level synthesis (HLS) can be used to synthesize register transfer level (RTL) descriptions from C/C++ or Python code. This allows hardware to be programmed at a higher level of abstraction without having to manually write RTL code. Specific HLS tools mentioned include Xilinx Vivado HLS, Altera OpenCL, Veriloggen for Python, and synthesizing hardware from languages like C, C++, Java, and Python.
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
In this deck from the HPC User Forum at Argonne, Deepak Pathania presents: SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in Server Chassis.
"The NEC Vector Engine Processor was developed using 16 nm FinFET process technology for extreme high performance and low power consumption. The Vector Engine Processor has the world's first implementation of one processor with six HBM2 memory modules using Chip-on-Wafer-on-Substrate technology, leading to the world-record memory bandwidth of 1.2 TB/s."
Watch the video: https://wp.me/p3RLHQ-kOK
Learn more: https://www.nec.com/en/global/solutio...
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This is a presentation for the MEMO conference that highlights what this MILI program is all about. MILI = Metronet Information Literacy Initiative. What is it and why is it important?
"Social Media for PGA Pros: Getting Started on Twitter, Facebook Linked and Twitter" was presented March 21, 2011 at the Northern Ohio PGA Spring Meeting.
CrossRef Annual Meeting 2012 FundRef Fred DyllaCrossref
This document summarizes a progress report on the FundRef initiative presented at the CrossRef Annual General Meeting on November 14, 2012. FundRef is a pilot project led by CrossRef to develop a standard way for scholarly publications to report their funding sources. The progress report outlines the challenges of attributing publications to funders, describes the benefits for stakeholders, and provides updates on the pilot project's participants, timetable, and goals. Publishers have begun adding funder metadata to test publications and depositing them in CrossRef and CrossMark. The pilot aims to define requirements for a funder registry and demonstrate a methodology for connecting publications and funders.
This document provides tips for using Twitter to expand one's professional network and engage students. It recommends following experts in one's field, searching for top professors on specific topics, and participating in education-focused Twitter chats. Hashtags are suggested for capturing classroom discussions and following relevant conversations. Resources are listed to help new Twitter users understand basic functions like retweets and favorites.
This is a presentation I'll be doing for the Twin Cities Media Alliance. I'll be presenting at public libraries around the Twin Cities metro on apps you can use for your business or organization.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using GPUs for general purpose computing. It provides examples showing that GPUs can compute normal vectors for images significantly faster than CPUs, with times of 125 clock cycles for a 640x480 image on GPU vs 625 on CPU and 172 clock cycles for a 1280x1024 image on GPU vs 2500 on CPU. It also provides an overview of tools for GPGPU programming, such as CUDA and shader languages, and how GPUs are specialized for parallel processing which allows them to outperform CPUs for certain tasks.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
The document discusses deep learning applications design, development and deployment in IoT edge. It describes using a Power9 system to train artificial neural network models using the MNIST dataset. It also covers building inference engines for Android phones and deploying visual recognition models to IBM Watson Studio.
Graphics processing units (GPUs) are increasingly being used for general-purpose computing applications due to their highly parallel and programmable nature. GPU computing uses the GPU alongside the CPU in a heterogeneous model, with the sequential CPU portion handling control flow and passing data to the GPU for parallel intensive computations. GPUs have evolved from fixed-function processors into fully programmable parallel processors. Many applications that require large amounts of parallelism and throughput can benefit from offloading work to the GPU. GPU architectures provide a high degree of parallelism through multiple stream processors that can execute the same instructions on different data sets. Software environments like CUDA and OpenCL allow general-purpose programming of GPUs for applications beyond graphics. Future improvements may include
GPUs have evolved from graphics cards to platforms for general purpose high performance computing. CUDA is a programming model that allows GPUs to execute programs written in C for general computing tasks using a single-instruction multiple-thread model. A basic CUDA program involves allocating memory on the GPU, copying data to the GPU, launching a kernel function that executes in parallel across threads on the GPU, copying results back to the CPU, and freeing GPU memory.
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
1) The document proposes an OpenGL based testing tool architecture for exascale computing to improve performance and accuracy of OpenGL programs.
2) It identifies common errors that occur when programming shaders in OpenGL Shading Language (GLSL) such as errors in file reading, compilation, linking, and rendering.
3) The proposed testing architecture divides the GLSL programming process into four stages - file reading, compilation, pre-linking/linking, and rendering - and validates each stage to detect errors and enforce error-free code.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
In this deck from the HPC User Forum at Argonne, Jean-Marc Denis presents: An Update on the European Processor Initiative.
"The EPI project aims to deliver a high-performance, low-power processor, implementing vector instructions and specific accelerators with high bandwidth memory access. The EPI processor will also meet high security and safety requirements. This will be achieved through intensive use of simulation, development of a complete software stack and tape-out in the most advanced semiconductor process node. SGA1 will provide a competitive chip that can effectively address the requirements of the HPC, AI, automotive and trusted IT infrastructure markets."
Watch the video: https://wp.me/p3RLHQ-kRB
Learn more: https://www.european-processor-initiative.eu/project/epi/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
This document discusses the implementation of a finite impulse response (FIR) filter on a graphics processing unit (GPU). It outlines how FIR filters can be represented using textures on the GPU and implemented using fragment programs. The performance of FIR filters and related transformations implemented on the GPU is evaluated. Texture upload and download between GPU and main memory accounts for up to 60% of the total processing time. While GPU computation is faster than CPU for these algorithms, optimization techniques from CPU programming do not always apply to the GPU.
The document discusses the evolution of GPU architecture and capabilities over time. It describes how GPUs have become massively parallel processors with programmable capabilities beyond just graphics. The document outlines the core components of a GPU including the graphics pipeline and programming model. It also discusses how GPUs are well suited for parallel, data-intensive applications and how their capabilities have expanded into general purpose computing through technologies like CUDA.
This document provides an update on PGI compilers and tools for heterogeneous supercomputing. It discusses PGI's support for OpenACC directives to accelerate applications on multicore CPUs and NVIDIA GPUs from a single source. It highlights new compiler features including support for Intel Skylake, AMD EPYC and IBM POWER9 CPUs as well as NVIDIA Volta GPUs. Benchmark results show strong performance of OpenACC applications on these platforms. The document also discusses the growing adoption of OpenACC in HPC applications and resources available to support OpenACC development.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the on-demand sessions from the OpenACC Summit 2020, upcoming GPU Hackathons and Bootcamps, an OpenACC-to-FPGA framework, the NERSC GPU Hackathon, new resources and more!
This is a presentation for the MEMO conference that highlights what this MILI program is all about. MILI = Metronet Information Literacy Initiative. What is it and why is it important?
"Social Media for PGA Pros: Getting Started on Twitter, Facebook Linked and Twitter" was presented March 21, 2011 at the Northern Ohio PGA Spring Meeting.
CrossRef Annual Meeting 2012 FundRef Fred DyllaCrossref
This document summarizes a progress report on the FundRef initiative presented at the CrossRef Annual General Meeting on November 14, 2012. FundRef is a pilot project led by CrossRef to develop a standard way for scholarly publications to report their funding sources. The progress report outlines the challenges of attributing publications to funders, describes the benefits for stakeholders, and provides updates on the pilot project's participants, timetable, and goals. Publishers have begun adding funder metadata to test publications and depositing them in CrossRef and CrossMark. The pilot aims to define requirements for a funder registry and demonstrate a methodology for connecting publications and funders.
This document provides tips for using Twitter to expand one's professional network and engage students. It recommends following experts in one's field, searching for top professors on specific topics, and participating in education-focused Twitter chats. Hashtags are suggested for capturing classroom discussions and following relevant conversations. Resources are listed to help new Twitter users understand basic functions like retweets and favorites.
This is a presentation I'll be doing for the Twin Cities Media Alliance. I'll be presenting at public libraries around the Twin Cities metro on apps you can use for your business or organization.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using GPUs for general purpose computing. It provides examples showing that GPUs can compute normal vectors for images significantly faster than CPUs, with times of 125 clock cycles for a 640x480 image on GPU vs 625 on CPU and 172 clock cycles for a 1280x1024 image on GPU vs 2500 on CPU. It also provides an overview of tools for GPGPU programming, such as CUDA and shader languages, and how GPUs are specialized for parallel processing which allows them to outperform CPUs for certain tasks.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
The document discusses deep learning applications design, development and deployment in IoT edge. It describes using a Power9 system to train artificial neural network models using the MNIST dataset. It also covers building inference engines for Android phones and deploying visual recognition models to IBM Watson Studio.
Graphics processing units (GPUs) are increasingly being used for general-purpose computing applications due to their highly parallel and programmable nature. GPU computing uses the GPU alongside the CPU in a heterogeneous model, with the sequential CPU portion handling control flow and passing data to the GPU for parallel intensive computations. GPUs have evolved from fixed-function processors into fully programmable parallel processors. Many applications that require large amounts of parallelism and throughput can benefit from offloading work to the GPU. GPU architectures provide a high degree of parallelism through multiple stream processors that can execute the same instructions on different data sets. Software environments like CUDA and OpenCL allow general-purpose programming of GPUs for applications beyond graphics. Future improvements may include
GPUs have evolved from graphics cards to platforms for general purpose high performance computing. CUDA is a programming model that allows GPUs to execute programs written in C for general computing tasks using a single-instruction multiple-thread model. A basic CUDA program involves allocating memory on the GPU, copying data to the GPU, launching a kernel function that executes in parallel across threads on the GPU, copying results back to the CPU, and freeing GPU memory.
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
1) The document proposes an OpenGL based testing tool architecture for exascale computing to improve performance and accuracy of OpenGL programs.
2) It identifies common errors that occur when programming shaders in OpenGL Shading Language (GLSL) such as errors in file reading, compilation, linking, and rendering.
3) The proposed testing architecture divides the GLSL programming process into four stages - file reading, compilation, pre-linking/linking, and rendering - and validates each stage to detect errors and enforce error-free code.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
In this deck from the HPC User Forum at Argonne, Jean-Marc Denis presents: An Update on the European Processor Initiative.
"The EPI project aims to deliver a high-performance, low-power processor, implementing vector instructions and specific accelerators with high bandwidth memory access. The EPI processor will also meet high security and safety requirements. This will be achieved through intensive use of simulation, development of a complete software stack and tape-out in the most advanced semiconductor process node. SGA1 will provide a competitive chip that can effectively address the requirements of the HPC, AI, automotive and trusted IT infrastructure markets."
Watch the video: https://wp.me/p3RLHQ-kRB
Learn more: https://www.european-processor-initiative.eu/project/epi/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
This document discusses the implementation of a finite impulse response (FIR) filter on a graphics processing unit (GPU). It outlines how FIR filters can be represented using textures on the GPU and implemented using fragment programs. The performance of FIR filters and related transformations implemented on the GPU is evaluated. Texture upload and download between GPU and main memory accounts for up to 60% of the total processing time. While GPU computation is faster than CPU for these algorithms, optimization techniques from CPU programming do not always apply to the GPU.
The document discusses the evolution of GPU architecture and capabilities over time. It describes how GPUs have become massively parallel processors with programmable capabilities beyond just graphics. The document outlines the core components of a GPU including the graphics pipeline and programming model. It also discusses how GPUs are well suited for parallel, data-intensive applications and how their capabilities have expanded into general purpose computing through technologies like CUDA.
This document provides an update on PGI compilers and tools for heterogeneous supercomputing. It discusses PGI's support for OpenACC directives to accelerate applications on multicore CPUs and NVIDIA GPUs from a single source. It highlights new compiler features including support for Intel Skylake, AMD EPYC and IBM POWER9 CPUs as well as NVIDIA Volta GPUs. Benchmark results show strong performance of OpenACC applications on these platforms. The document also discusses the growing adoption of OpenACC in HPC applications and resources available to support OpenACC development.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the on-demand sessions from the OpenACC Summit 2020, upcoming GPU Hackathons and Bootcamps, an OpenACC-to-FPGA framework, the NERSC GPU Hackathon, new resources and more!
The document provides an overview of OpenCL, including:
- OpenCL allows programs to execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
- It defines an programming model for parallel computation along with a framework API for controlling devices and allocating memory.
- The OpenCL framework handles compiling programs for different devices and scheduling work across processors. It provides interfaces for querying platforms and devices, creating contexts, and managing memory and command queues.
- OpenCL aims to standardize parallel programming and overcome the need to learn separate APIs for each type of hardware as processors evolve with increasing core counts.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
The document discusses pmux, a file-based MapReduce tool developed by IIJ that uses Unix standard input/output. Pmux can perform distributed tasks like grep across files on GlusterFS. It works by having a dispatcher assign map tasks to worker nodes, which perform the tasks and return results. For tasks with reduce phases, it produces intermediate files that are shuffled before reduce tasks are assigned. An example uses pmux to count word frequencies. Related tools include pmux-gw for a HTTP interface and pmux-logview for visualizing job progress. Performance testing showed pmux could finish a task 300 times faster using 60 nodes compared to a single node.
Amruth Kumar Juturu is a computer science graduate with a Master's degree from Texas A&M University and a Bachelor's degree from Indian Institute of Information Technology. He has work experience as an intern at Nvidia and as a senior software developer at Citrix R&D India. Currently he is a graduate assistant at the Supercomputing Facility at TAMU where he provides technical support to users. He has strong skills in programming languages like C/C++, Java, Python and tools like MPI, OpenMP, CUDA. His academic and work projects include developing a virtual machine manager, transport drivers, proxy servers and device overlays.
MLBlock is updating with new features based on user feedback. It will become a standalone application written in Typescript with a NextJS frontend and backend bundled together as a Docker container. The update will focus on peer-to-peer connectivity, Google Colab integration, Home Assistant integration, and support for application updates. Developers can look forward to using MLBlock via the command line with a new MLBlock CLI.
The document discusses the internet of things (IoT) and how it connects physical devices to the internet through sensors, software and network connectivity. It provides examples of various smart IoT devices like a smart fork, water bottle, toothbrush and more. It then discusses some of the key concepts and technologies behind IoT like MQTT, a lightweight messaging protocol commonly used for IoT. It also outlines the history and growth of IoT from its origins in the late 2000s to the projected 50 billion connected devices by 2020.
The document discusses the structure of KidBright plugins and how to generate a new plugin using the KidBright plugin generator. It lists the key files that a plugin contains such as blocks.js, generators.js, routes.js, and code files like NTTRandom.h and NTTRandom.cpp. The generator can be installed and used to quickly create a new plugin project template called NTTMyTestPlugin.
The document discusses the structure of KidBright plugins and how to generate a new plugin using the KidBright plugin generator. It lists the key files that a plugin contains, such as blocks.js for the plugin blocks, generators.js for generators, and msg/en.js for localization strings. It also provides code snippets for a sample NTTRandom plugin implementation with C++ and header files.
ESP-Now is a protocol developed by Espressif that enables wireless devices to communicate in a peer-to-peer and connectionless manner without using Wi-Fi. It requires pairing between devices first, then allows for persistent and encrypted connections of up to 250 bytes. It is ideal for applications like smart lights and sensors. Some limitations include a lack of broadcast support and limited numbers of encrypted peers.
The document discusses the Internet of Things (IoT). It defines IoT as the network of physical devices embedded with sensors, software and network connectivity that enables the collection and exchange of data. It explains that IoT allows objects to be sensed and controlled remotely via existing network infrastructure. It estimates that IoT will consist of almost 50 billion connected objects by 2020. It also provides examples of common IoT applications and discusses some of the key enabling technologies and protocols used in IoT systems like MQTT.
This document discusses Internet of Things (IoT) technologies and protocols. It provides examples of smart IoT devices like a smart fork, toothbrush, air conditioner, bike, and hydroponic system. It then explains common IoT protocols like HTTP, MQTT, and CoAP. MQTT in particular is highlighted as a lightweight protocol suitable for IoT messaging. The document also discusses MQTT features like publishing, subscribing, retained messages, and presence. Finally, resources for learning more about MQTT are provided.
This document contains code snippets for connecting to a Netpie broker using an access token with an OAuth token and secret, appkey, and endpoint to subscribe to topics starting with "/HelloChiangMaiMakerClub/" using the mosquitto_sub client.
The document introduces IBM Bluemix and IoT Foundation, providing an overview of their architecture and how to get started. It explains that Bluemix allows users to publish and subscribe to topics without sign-up, and provides a quickstart demo for developing IoT applications. The recap section lists key concepts like usernames, passwords, client IDs and topics. It encourages readers to explore documentation and try out a quickstart demo.
The document describes the Chiang Mai Maker Club (CMMC) and its work connecting devices using internet of things technology like the ESP8266 and MQTT protocol. It discusses pains faced around connectivity, reliability, and dependencies. The club created a wrapper called "talking-things" to manage devices and their communication, as well as a dashboard for quick device information. Overall the CMMC works to connect devices using IoT standards for cooperative technology advancement in Chiang Mai.
This document provides configuration instructions for setting up an OpenWrt device to act as both a wireless station and access point. It includes using uci to configure the station and access point interfaces, restarting the network, and examples of setting up an SSH tunnel for port forwarding.
This document provides an overview of a WebRTC demonstration session. It introduces the ChiangMai Maker Club as the presenters and lists the front-end technologies used like Angular.js and Bootstrap 3. The back-end technologies listed are Sails.js, Express.js and Node.js. It then describes using real-time communication like Socket.io and WebRTC to draw to a canvas periodically, send the data to the server, have the server process and send back the data to paint the processed image on the client and control RGB LEDs.
LoveNotYet - The first Thailand sex education game.Nat Weerawan
The document discusses taboo topics in Thai schools and solutions to address them. It notes that Thailand has a high teen pregnancy rate compared to global averages. It proposes having two-way conversations between parents and children and teachers and students about taboo topics, as well as using games, to help reduce misunderstandings. An app was created and saw over 100,000 downloads, with the goal of improving education and outcomes.
Raspberry Pi presentation at Beercamp & Barcamp Chiangmai, Thailand.
In our presentation we talking about how to getting started with embedded system both Raspberry Pi & Arduino from zer0 to share our knowledges to another person who want to learning Raspberry Pi.
This document outlines several experiential projects focused on Raspberry Pi, Arduino, and sensors including building a data logger, real-time technology projects using Rabbit MQ and Angular.js, a demo using Meteor.js and Rabbit MQ with Mongo DB, and demos of Pubsub and PubNub. It also mentions building a home media center for home entertainment.
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)Nat Weerawan
The document outlines the schedule and activities for a team working on a sex education game over a weekend. It includes meetings with the team on Monday evening, working on setup and design on Tuesday, and brainstorming and finalizing a presentation on Sunday to present the game. Key members included a Rails Guy, Django Guy, and someone who was originally not the designer but became the designer. They worked overnight multiple times to complete the project.
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Griffon Topic2 Presentation (Tia)
1. GriffonGPU Programming API for Scientific and General Purpose PisitMakpaisit 4909611727 Supervisor : Dr. Worawan Diaz Carballo Department of Computer Science, Faculty of Science and Technology, Thammasat University
4. GPU programming model complexityMotivation 3/13/2010 2 Griffon - GPU Programming API for Scientific and General Purpose
5. GPU-CPU performance gap All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional processors http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 3/13/2010 3 Griffon - GPU Programming API for Scientific and General Purpose
6. GPGPU General-Purpose computation on Graphics Processing Units Very high computation and data throughput Scalability 3/13/2010 4 Griffon - GPU Programming API for Scientific and General Purpose
7. GPGPU Applications Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc. 3/13/2010 5 Griffon - GPU Programming API for Scientific and General Purpose
8. Vector Addition 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 6 Vector A + Vector B = Vector C
9. Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 7 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } Declare Function void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; Declare Variables A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); Memory Allocate VecAdd(A,B,C); Function Call free(A); free(B); free(C); } Memory De-Allocate
10. Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 8 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
11.
12.
13.
14. Vector Addition (OpenMP) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 11 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
15. Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 12 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2 Execution time (Parallel on CPU) Vector Addition
16. OpenMP 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 13 Easy and automatic threads management Few threads on CPU
17. Vector Addition (GPU - CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 14 Vector A on CPU + + + + + + + Copy + + + Vector B on CPU = = = = = = = = = = Vector C on CPU Copy 6 9 9 14 7 7 11 7 15 7 CPU Memory GPU Memory
19. Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 16 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); Data Transferfrom CPU to GPU addVec<<<1, SIZE>>>(d_A, d_B, d_C); Kernel Call Data Transferfrom GPU to CPU cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); free(h_A); free(h_B); free(h_C); CPU Memory De-Allocate cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } GPU Memory De-Allocate
20. Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 17 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16 Execution time (Parallel on GPU)
21. CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 18 Speed up but spend more effort and time Many threads on GPU
22.
23. Local Memory – per one thread , faster than Global Memory
24.
25. Parallel Vector Addition on GPU (Griffon) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 21 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); VecAdd(A,B,C); free(A); free(B); free(C); } 1. Sequential Code #pragmagfn parallel for 2. Add Compiler Directive 3. Finish So Easy !!
26. Griffon 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 22 Compiler directive for C-Language Source-to-source compiler Automatic data management Optimization
28. Objectives (1/2) To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-sourcecompiler Simple – The numbers of compiler directives do not exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API. Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes. 3/13/2010 24 Griffon - GPU Programming API for Scientific and General Purpose
29. Objectives (2/2) To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management of generated codes is done automatically by Griffon. Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%. 3/13/2010 25 Griffon - GPU Programming API for Scientific and General Purpose
31. Project Constraint 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 27 Griffon is a C-language API that supports both Windows and Linux environments The generated executable program can only run on the NVIDIA graphic card. Uses can use Griffon in cooperated with OpenMP.
33. Brook+ & CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 29 General propose computation on GPU Manual kernel and data transfer on various GPU memory management Vendor dependent
34. OpenCL (Open Computing Language) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 30 Cross-platform and Vendor neutral Approachable language for accessing heterogeneous computational resources (CPU, GPU, other processor) Data and Task Parallelism
35. OpenMP to GPGPU 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 31 OpenMP applications into CUDA-based GPGPU applications GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality
36. hiCUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 32 Directive-based GPU Programming Language Computation Model for identify code region that executed on GPU Data Model for allocate and de-allocate memory on GPU and data transfer
41. Software Architecture NVCC is one of the Griffon toolchain. Griffon source-to-source compiler comprises oMemory Allocator and Optimizer Griffon C Application Griffon Compiler Compile-time Memory Allocator Optimizer CUDA C Application NVCC (NVIDIA CUDA Compiler) PTX code C code PTX compiler GCC (Linux),CL (MS Windows) CPU object code GPU object code Executable 3/13/2010 34 Griffon - GPU Programming API for Scientific and General Purpose
43. Griffon Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 36 Specify kernel work flow Define parallel region Parallel Region Control Flow GPU/CPU Overlap Compute Synchronous Define synchronous point Define region that CPU overlap compute with GPU
44. Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 37 General Form #pragmagfn directive-name [clause[ [,] clause]...] new-line Parallel Region #pragmagfn parallel for [clause[ [,] clause]...] new-line for-loops Clause : kernelname(name) waitfor(kernelname-list) private(var-list) accurate([low,high]) reduction(operator:var-list)
45. Parallel Region 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 38 for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; }
49. #pragmagfn parallel for kernelname( D ) waitfor( B,C )A C B Kernel B and C can compute in parallel D
50. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 40 P0 P0 P1 P1 P2 P2 P3 P3 Synchronous Point Barrier #pragmagfn barrier new-line Atomic #pragmagfn atomic newline assignment-statement Parallel Reduction #pragmagfn parallel for reduction(operation,var-list)
51. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 41 #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } #pragmagfn parallel for for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 1 for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ C[i] += x / 5; } } #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 2
52. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 42 for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); } #pragmagfn parallel for br />private(x) reduction(+:integral) for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); }
53. GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 43 #pragmagfnoverlapcompute(kernelname)newline structure-block Many threads on GPU CPU function Parallel GPU/CPU Synchronize
54. GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 44 #pragmagfn parallel for kernelname( calA ) for(i=0;i<N;i++){ … } #pragmagfnoverlapcompute( calA ) independenceCpuFunction(); for(i=0;i<N;i++){ … } independenceCpuFunction();
55. Accurate Level #pragmagfn parallel for accurate( [low, high] ) Use low when speed is important Use high when precision is important Default is high 3/13/2010 45 Griffon - GPU Programming API for Scientific and General Purpose
57. Create Kernel 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 47 __global__ void __kernel_0(…, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid[* 1 + 0] ; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1); // Insert kernel call return 0; } int main(){ int sum = 0; int x, y; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }
58. For-Loop Format and Thread Mapping For-loop must be in format for( index = min ; index <= max ; index += increment ){…} for( index = max ; index >= min ; index -= increment ){…} // This case will be transformed to first case Number of Thread can calculate by formula Iterative Index and Thread Mapping __tid = blockIdx.x * blockDix.x + threadIdx.x; index = __tid * increment + min; 3/13/2010 48 Griffon - GPU Programming API for Scientific and General Purpose
59. Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 49 Shared variables much be pass to kernel function Private variables mush be declare in kernel fucntion Declare GPU device variables for shared variable Size for allocate Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc
60. Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 50 __global__ void __kernel_0(int * A, int * B, int * C, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid [* 1 + 0] ; int x, y; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; int * __d_A ,* __d_B ,* __d_C ; cudaMalloc((void**)&__d_C,sizeof(int) * N); cudaMalloc((void**)&__d_B,sizeof(int) * N); cudaMalloc((void**)&__d_A,sizeof(int) * N); __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1); cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A); return 0; } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }
70. Automatic cache with shared memoryOptimization Techniques 3/13/2010 56 Griffon - GPU Programming API for Scientific and General Purpose
71.
72. Defined variableA, B transfer from CPU to GPU C transfers from GPU to CPU D is both #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i] + D[i]; D[i] = C[i] * 0.5; }
73. Reduce data transfer with kernel control flow MemcpyHost to Device for Variable that is defined in kernel MemcpyDevice to Host for Variable that is used in kernel #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice ); cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice ); Kernel <<< … , … >>> ( … ) cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); A B K1 C 3/13/2010 58 Griffon - GPU Programming API for Scientific and General Purpose
74. Reduce data transfer with kernel control flow Use graph defined by kernelname and waitfor construct #pragmagfn parallel for br />kernelname(k1) for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for br />kernelname(k2) waitfor(k1) for(i=0;i<N;i++){ E[i] = A[i] * C[i] – D[i]; C[i] = E[i] / 3.0; } A B K1 D C A C K2 E C 3/13/2010 59 Griffon - GPU Programming API for Scientific and General Purpose
75. Reduce data transfer with kernel control flow If there is a path from k1 to k2 If invar of k1 is same as invar of k2 delete invar of k2 If outvar of k1 is same as outvar of k2 delete outvar of k1 if outvar of k1 is same as invar of k2 delete invar of k2 A B K1 D C A C K2 E C 3/13/2010 60 Griffon - GPU Programming API for Scientific and General Purpose
76. Schedule Kernel and Memcpy for Maximum overlap K1 Already reduce transfer nodes graph A B K2 How to schedule? C D K3 E 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose
77. Schedule for synchronous function 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 62 62 K1 K2 A B D K3 C E Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE) New version of CUDA API has asynchronous data transfer function
78. Schedule Kernel and Memcpy for Maximum overlap Memcpy and Kernel can be overlaped Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost 4-ways overlap If include CPU compute by overlapcompute directive 3/13/2010 63 Griffon - GPU Programming API for Scientific and General Purpose Level 1 K1 A 1 2 Level 2 K2 B C 1 2 3 Level 3 K3 D 1 2 Level 4 E 1
79. 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 64 Set queue to empty Until all node is deleted 1.1. Set level =1 and stream_num = 1; 1.2. Find 0 incoming degree kernel node, delete node and link, create transfer command with stream_num 1.2.1. if found in 1.2 stream_num += 1 1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num 1.3.1 if found in 1.3 stream_num += 1 1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num 1.4.1 if found in 1.4 stream_num += 1 1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node 1.6. Insert synchronous function 1.7. Collect max stream_num 1.8. level += 1; Level 1 K1 A 1 2 Level 2 K2 B C 1 2 3 Level 3 K3 D 1 2 Level 4 E 1
80. Automatic cache with shared memory When detect “linear access” pattern in kernel automatic cache will work Global Memory … Shared Shared Shared Thread block1 Thread block 3 Thread block n Shared Thread block2 #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } 3/13/2010 65 Griffon - GPU Programming API for Scientific and General Purpose
81. Automatic cache with shared memory #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } __global__ void __kernel_0 (int * B, int * A, int __N) { int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid * 1 + 1 ; __shared__ intsa[514] ; if(__tid < __N) { sa[threadIdx.x + 0] = A[i + 0 - 1]; if(threadIdx.x + 512 < 514) sa[threadIdx.x + 512] = A[i + 512 - 1]; __syncthreads(); B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1]; } } 3/13/2010 66 Griffon - GPU Programming API for Scientific and General Purpose
82. DEMO 3/13/2010 67 Griffon - GPU Programming API for Scientific and General Purpose
85. Compiler Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 69 5 undergraduate students who have studied the concepts of CUDA only 1.5 hour of demonstration
86. Compiler Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 70 Calculation of Pi Using Numerical Integration Calculation of Pi Using the Monte Carlo Method Trapezoidal Rule VectorNormalization Calculate Sine of Vector’s Element
87. Compiler Performance 3/13/2010 71 Griffon - GPU Programming API for Scientific and General Purpose Calculation of Pi Using Numerical Integration Calculation of Pi Using the Monte Carlo Method Trapezoidal Rule VectorNormalization Calculate Sine of Vector’s Element
89. Griffon Instruction Total numbers of instructions (Directive + Clause): 9 Problem is performance of high communication degree parallel program Improve directive for describe algorithm in program (Divide and conquer, Partial summation, etc.) New optimization technique such as cache with shared memory, appropriate thread number 3/13/2010 73 Griffon - GPU Programming API for Scientific and General Purpose
90. Performance factor and speed up Computation density is most effect on Performance 3/13/2010 74 Griffon - GPU Programming API for Scientific and General Purpose
91. Building S2S Compiler Source to source compilers aren’t popular Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX compiler could be very efficient, they cannot gain any benefits from manual optimization. 3/13/2010 75 Griffon - GPU Programming API for Scientific and General Purpose
92. Future Work Optimization Techniques Data Structure Loop transformation Directives More support OpenMP CPU/GPU Parallel region Support OpenCL Compiler Support C++, other language Support popular IDE 3/13/2010 76 Griffon - GPU Programming API for Scientific and General Purpose
93. Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org IliasLeontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley 3/13/2010 77 Griffon - GPU Programming API for Scientific and General Purpose