Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
An overview of the inner-workings of OpenJDK - with emphasis on...
- what triggers the just-in-time compiler (JIT)
- types of speculative optimizations performed by the JIT
- aspects of the Java language & ecosystem that make ahead-of-time (AOT) compilation challenging
Azul Virtual Machine Engineer Douglas Hawkins describes how decisions made by the JVM affect how your code is compiled and run. Learn how this affects application performance and what steps you can take to optimize how the JVM acts on your code.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
This will address two recently concluded Kaggle competitions.
1. Google landmark retrieval
2. Google landmark recognition
The talk would focus on image retrieval and recognition in large scale. The tentative plan for the presentation:
Primer on signal analysis (DFT, Wavelets).
Primer on information retrieval.
Tips for parallelizing your data pipeline.
Description of my approach and detailed discussion of bottlenecks, limitations and lessons.
In-depth analysis of winning solutions.
This will be a combination of theoretical rigor and practical implementation.
EdSketch: Execution-Driven Sketching for JavaLisa Hua
Sketching is a relatively recent approach to program synthesis, which has shown much promise. The key idea in sketching is to allow users to write partial programs that have “holes” and provide test harnesses or reference implementations, and let synthesis tools create program fragments that the holes such that the resulting complete program has the desired functionality. Traditional solutions to the sketching problem perform a translation to SAT and employ CEGIS. While e ective for a range of programs, when applied to real applications, such translation-based approaches have a key limitation: they require either translating all relevant libraries that are invoked directly or indirectly by the given sketch – which can lead to impractical SAT problems – or creating models of those libraries – which can require much manual effort.
is paper introduces execution-driven sketching, a novel approach for synthesis of Java programs using a backtracking search that is commonly employed in so ware model checkers. e key novelty of our work is to introduce effective pruning strategies to effciently explore the actual program behaviors in presence of libraries and to provide a practical solution to sketching small parts of real-world applications, which may use complex constructs of modern languages, such as reflection or native calls. Our tool EdSketch embodies our approach in two forms: a stateful search based on the Java PathFinder model checker; and a stateless search based on re-execution inspired by the VeriSoft model checker. Experimental results show that EdSketch’s performance compares well with the well-known SAT-based Sketch system for a range of small but complex programs, and moreover, that EdSketch can complete some sketches that require handling complex constructs.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Story of static code analyzer developmentAndrey Karpov
Greetings from the past: simple tools and bad standards. Regular expressions don’t work. What is inside modern static code analyzers on the PVS-Studio example. About machine learning. Learning vs Data Flow analysis.
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
De gemiddelde GPU bevat tegenwoordig meer PK's dan de CPU. Naar aanleiding hiervan komen er steeds meer mogelijkheden om computationele problemen te verplaatsen van de CPU naar de GPU. Deze presentatie zal een inleiding zijn hoe je dit in Java kunt doen met behulp van Jogamp JoCL. Aan de hand van enkele simpele problemen wordt aangetoond wanneer een GPU beter ingezet kan worden dan een CPU en vice versa. Dit is ook een van de speerpunten in Java 9 (Project Sumatra) wat o.a. JoCL als inspiratie gebruikt.
An overview of the inner-workings of OpenJDK - with emphasis on...
- what triggers the just-in-time compiler (JIT)
- types of speculative optimizations performed by the JIT
- aspects of the Java language & ecosystem that make ahead-of-time (AOT) compilation challenging
Azul Virtual Machine Engineer Douglas Hawkins describes how decisions made by the JVM affect how your code is compiled and run. Learn how this affects application performance and what steps you can take to optimize how the JVM acts on your code.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
This will address two recently concluded Kaggle competitions.
1. Google landmark retrieval
2. Google landmark recognition
The talk would focus on image retrieval and recognition in large scale. The tentative plan for the presentation:
Primer on signal analysis (DFT, Wavelets).
Primer on information retrieval.
Tips for parallelizing your data pipeline.
Description of my approach and detailed discussion of bottlenecks, limitations and lessons.
In-depth analysis of winning solutions.
This will be a combination of theoretical rigor and practical implementation.
EdSketch: Execution-Driven Sketching for JavaLisa Hua
Sketching is a relatively recent approach to program synthesis, which has shown much promise. The key idea in sketching is to allow users to write partial programs that have “holes” and provide test harnesses or reference implementations, and let synthesis tools create program fragments that the holes such that the resulting complete program has the desired functionality. Traditional solutions to the sketching problem perform a translation to SAT and employ CEGIS. While e ective for a range of programs, when applied to real applications, such translation-based approaches have a key limitation: they require either translating all relevant libraries that are invoked directly or indirectly by the given sketch – which can lead to impractical SAT problems – or creating models of those libraries – which can require much manual effort.
is paper introduces execution-driven sketching, a novel approach for synthesis of Java programs using a backtracking search that is commonly employed in so ware model checkers. e key novelty of our work is to introduce effective pruning strategies to effciently explore the actual program behaviors in presence of libraries and to provide a practical solution to sketching small parts of real-world applications, which may use complex constructs of modern languages, such as reflection or native calls. Our tool EdSketch embodies our approach in two forms: a stateful search based on the Java PathFinder model checker; and a stateless search based on re-execution inspired by the VeriSoft model checker. Experimental results show that EdSketch’s performance compares well with the well-known SAT-based Sketch system for a range of small but complex programs, and moreover, that EdSketch can complete some sketches that require handling complex constructs.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Story of static code analyzer developmentAndrey Karpov
Greetings from the past: simple tools and bad standards. Regular expressions don’t work. What is inside modern static code analyzers on the PVS-Studio example. About machine learning. Learning vs Data Flow analysis.
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
De gemiddelde GPU bevat tegenwoordig meer PK's dan de CPU. Naar aanleiding hiervan komen er steeds meer mogelijkheden om computationele problemen te verplaatsen van de CPU naar de GPU. Deze presentatie zal een inleiding zijn hoe je dit in Java kunt doen met behulp van Jogamp JoCL. Aan de hand van enkele simpele problemen wordt aangetoond wanneer een GPU beter ingezet kan worden dan een CPU en vice versa. Dit is ook een van de speerpunten in Java 9 (Project Sumatra) wat o.a. JoCL als inspiratie gebruikt.
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
HotSpot promises to do the "right" thing for us by identifying our hot code and compiling "just-in-time", but how does HotSpot make those decisions?
This presentation aims to detail how HotSpot makes those decisions and how it corrects its mistakes through a series of demos that you run yourself.
Here we are going to take a look how to use for loop, foreach loop and while loop. Also we are going to learn how to use and invoke methods and how to define classes in Java programming language.
Conflux provides a parallel programming framework to use CPUs and GPUs in collaboration as components of an integrated computing system. Conflux proposes already known kernel-based architecture that is compatible with CUDA,
Building High-Performance Language Implementations With Low EffortStefan Marr
This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.
Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.
With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.
In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.
While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15)
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
Akihiro Hayashi, Rishi Surendran, Jisheng Zhao, Michael Ferguson, Vivek Sarkar. The 1st Chapel Implementers and Users Workshop (CHIUW), May 23rd, 2014, Phoenix AZ (co-located with IPDPS2014).
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Accelerating Habanero-Java Program with OpenCL Generation
1. Accelerating Habanero-Java
Programs with OpenCL Generation
Akihiro Hayashi, Max Grossman,
Jisheng Zhao, Jun Shirako, Vivek Sarkar
Rice University, Houston, Texas, USA
1
2. Background:
GPGPU and Java
The initial wave of programming models for
GPGPU has provided low-level APIs:
CUDA (NVIDIA)
OpenCL (Khronos)
→Often faster than natively running application
High-level languages such as Java provide
high-productivity features:
Type safety
Garbage Collection
Precise Exception Semantics
2
4. Computation
Body
RootBeer
API
Related Work:
RootBeer
public class ArraySum {
public static void main(String[] args) {
int[][] arrays = new int[N][M];
int[] result = new int[N];
... arrays initialization ...
List<Kernel> jobs =
new ArrayList<Kernel>();
for(int i = 0; i < N; i++) {
jobs.add(new ArraySumKernel(arrays[i],
result, i);
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs); } }
class ArraySumKernel implements Kernel {
private int[] source;
private int[] ret;
private int index;
public ArraySumKernel(int[] source,
int[] ret, int i) {
this.source = source;
this.ret = ret; this.index = i;
}
public void gpuMethod() {
int sum = 0;
for(int i = 0; i < source.length; i++) {
sum += source[i];
}
ret[index] = sum;
}
}
Requires special API invocation in addition
to computation body
4
5. Our Approach:
HJ-OpenCL Overview
Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct forall
Built on the top of Habanero-Java
Language
(PPPJ’11)
OpenCL acceleration with precise
exception semantics
Our primary contribution
5
6. Overview of Habanero-Java (HJ)
Language
New language and implementation developed at Rice
since 2007
Derived from Java-based version of X10 language (v1.5) in 2007
HJ is currently an extension of Java 1.4
All Java 5 & 6 libraries and classes can be called from HJ programs
HJ’s parallel extensions are focused on task parallelism
1. Dynamic task creation & termination: async, finish, force, forall, foreach
2. Collective and point-to-point synchronization: phaser, next
3. Mutual exclusion and isolation: isolated
4. Locality control --- task and data distributions: places, here
Sequential HJ extensions added for convenience
extern, point, region, pointwise for, complex data type, array views
Habanero-C and Habanero-Scala are also available with similar
constructs
6
7. HJ OpenCL
Implementation
HJ-OpenCL Example
public class ArraySum {
public static void main(String[] args) {
int[] base = new int[N*M];
int[] result = new int[N];
int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-
1]);
... initialization ...
boolean isSafe = ...;
safe(isSafe) {
forall(point [i] : [0:N-1]) {
result[i] = 0;
for(int j=0; j<M; j++) {
result[i] += arrays[i,j];
}
}
}
}
}
→Programmers can utlize OpenCL by just putting fora
Safe
Construct for
Precise
Exception
Semantics
7
8. The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(bytecode)
HJ
Compiler
C compiler
APARAPI
Translator
OpenCL Kernel
Kernel.c
Native library
(.so, .dll, .dylib)
JVM
Host
JNI
Device
OpenCL
Program is
translated into
three files
8
9. APARAPI
Open Source Project for data parallel Java
https://code.google.com/p/aparapi/
APARAPI converts Java bytecode to
OpenCL at runtime
9
Kernel kernel = new Kernel(){
@Override public void run(){
int i= getGlobalId();
result[i]=intA[i]+inB[i];
}
};
Range range = Range.create(result.length);
kernel.execute(range);
→we prepared static version
of APARAPI to reduce runtime overhead
11. Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
Picture is borrowed from
http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
12. For Precise Exception Semantics
on GPUs
“safe” language construct
Programmers specify the safe condition
Can be useful for testing too
12
safe (cond) { … }
13. Generated CodeHJ Implementation
Safe construct for exception
semantics
Asserts that no exception will be thrown
inside the body
boolean no_excp = …;
safe (no_excp) {
// mapped to GPU
forall () {
…
}
}
safe (cond) { … }
boolean no_excp = …;
if (no_excp) {
OpenCL_exec(); //
JNI
} else {
forall() {} // On JVM
}
13
15. Example of Safe Construct (Cont’d)
Exception Checkingboolean isSafe = true;
for (int i = 0; i < N; i++) {
if (index[i] >= result.length) isSafe = false;
}
safe(isSafe) {
forall(point [i] : [0:N-1]) {
for (j = 0; j < M; j++) {
result[index[i]] += A[j] * B[i, j];
}
}
}
Example 2: indirect array access
Indirect
acesses
15
Checks if all
element of index
is greater than
result.length
16. “next” construct for global barrier
synchronization on GPUs
Semantics
Wait until all thread reaches the synchronization point
Note that OpenCL does not support all-to-all
barrier as a kernel language feature
The HJ compiler internally partitions the forall loop
body into blocks separated by synchronization points
16
17. next construct (cont’d)
17
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchronization point 2
next;
}
Thread0
method1(0);
Thread1
method1(1);
WAIT
method2(0); method2(1);
WAIT
18. “ArrayView” for Supporting
Contiguous Multidimensional array
HJ ArrayView is backed by one-dimensional Java
Array
Enables reduction of data transfer between
host and device
Java Array
A[i][j]
HJ Array View
A[i, j]
0
1
2
0
0
1
2
0 1
0 1 2 3
A[0][1]
A[0,1]
18
19. Benchmarks
Benchmark Data Size Next?
Blackscholes 16,777,216 options No
Crypt JGF N = 50,000,000 No
MatMult 1024x1024 No
Doitgen Polybench 128x128x128 No
MRIQ Parboil 64x64x64 No
Syrk Polybench 2048x2048 No
Jacobi Polybench T=50, N = 134,217,728 No
SparseMatmult JGF N= 500,000 No
Spectral-norm CLBG N = 2,000 Yes
SOR JGF N = 2,000 Yes
19
20. Platforms
AMD A10-5800K Westmere
CPU 4-cores 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
448-cores
Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06)
JVM
HotSpot 64-Bit Server
VM
(build 17.0-b11,
mixed mode)
HotSpot 64-Bit Server
VM(Build 20.0-b11,
mixed mode)
20
21. Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ (on JVM)
Sequential HJ
Parallel HJ
HJ-OpenCL with Safe Construct (on Device)
OpenCL CPU
OpenCL GPU
21
24. Slowdown for exception checking
Device Black
Schol
es
Crypt MatM
ult
Doitge
n
MRIQ Syrk Jacobi Sparse
Matm
ult
Spectr
al-
Norm
SOR
CPU 0.99 0.99 1.00 1.04 1.03 0.99 1.00 0.94 0.98 0.98
GPU 1.02 0.99 1.00 1.00 1.00 1.00 0.97 0.91 1.00 1.00
On A10-5800K
Device Black
Schol
es
Crypt MatM
ult
Doitge
n
MRIQ Syrk Jacobi Sparse
Matm
ult
Spectr
al-
Norm
SOR
CPU 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.97 1.00 1.02
GPU 0.95 0.94 0.99 1.00 0.98 1.00 0.99 0.68 0.99 1.00
On Westmere
Indirect
acess
24
25. Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode to CUDA
X10 and Chapel
Provides programming model for CUDA
Sponge (ASPLOS’11)
Compiles StreamIt to CUDA
→ None of these approaches considers Java
Exception Semantics
25
26. Related Work:
Exception Semantics in Java
Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
Generates exception- safe and -unsafe regions of code.
Wurthinger et al.(PPPJ’07)
Proposes an algorithm on Static Single Assignment(SSA)
form for the JIT compiler which eliminates un- necessary
bounds checking.
ABCD (PLDI’00)
Provides an array bounds checking elimination algorithm,
which is based on graph traversal on an extended SSA
form.
Jeffery et al. (In Concurrency and Compu- tations:
Practice and Experience,‘09)
Proposes a static annotation framework to reduce the
overhead of dynamic checking in the JIT compiler.
26
27. Conclusions:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
“safe” construct for precise exception
semantics
“next” construct for barrier synchronization
Performance improvement
upto 55x speedup on AMD APU
upto 324x speedup on NVIDIA GPU
27
28. Future Work
Speculative Exception Checking
Speculative Execution of Parallel Programs
with Precise Exception Semantics. A.Hayashi
et al. (LCPC’13)
Automatic generation of exception checking
code
28
Editor's Notes
Hello, everyone. Welcome to my talk.
My name is Akihiro Hayashi and I’m a posdoc at Rice university.
Toiday I’ll be talking about accelerating habanero-java programs with OpenCL generation.
Let me first talk about the background.
Programming models for GPGPU, such as CUDA and OpenCL,
can enable significant performance and energy improvemtns for certain classes of applications.
These programming models provide C-like kernel language and low-level APIs which is usually accessible from C/C++.
On the other hand,
High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code.
Here is an example.
There are two kinds of code here.
The code shown on your left will be executed on host, and the code shown on your right will executed on device.
In host code, you can see JNI call which get and release array pointer of Java array and several OpenCL APIs like memory allocation, data transfe, kernel compilation and kernel invocation.
In kernel code, you can see C-like kernel code which is written in SPMD manner.
As you can see utilizing GPU from Java adds non-trivial amount of work.
In past work, Rootbeer provides high-level programming model for GPU.
We don’t need to write JNI and OpenCL Code anymore, But it requires programmer to write spececial class and special API to invoke GPU.
In our approach, we propose HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall.
HJ-OpenCL is built on the top of Haabnero-Java language.
It also provides a way to maintain precise exeception semantics