IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
There has been growing interest in harnessing the parallelism of Graphics Processing Units (GPUs) to accelerate analytics workloads. GPUs have become the standard platform for many machine learning algorithms, particularly in the field of deep neural networks (DNNs), while making increasing inroads into more traditional domains such as analytics databases and visual analytics. However there is a strong need to couple these new platforms with Apache Spark, which has emerged as the de facto analytics platform for data scientists. In this talk we discuss how we built a connector from Spark to the open source GPU-powered MapD Analytics Platform, and the use cases such a connector enables around being able to pull high value data from Spark and cache it on the GPU for subsequent interactive visual analysis and machine learning. We will conclude with a brief demo of an end-to-end Spark-to-MapD pipeline.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
There has been growing interest in harnessing the parallelism of Graphics Processing Units (GPUs) to accelerate analytics workloads. GPUs have become the standard platform for many machine learning algorithms, particularly in the field of deep neural networks (DNNs), while making increasing inroads into more traditional domains such as analytics databases and visual analytics. However there is a strong need to couple these new platforms with Apache Spark, which has emerged as the de facto analytics platform for data scientists. In this talk we discuss how we built a connector from Spark to the open source GPU-powered MapD Analytics Platform, and the use cases such a connector enables around being able to pull high value data from Spark and cache it on the GPU for subsequent interactive visual analysis and machine learning. We will conclude with a brief demo of an end-to-end Spark-to-MapD pipeline.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
Spark SQL Catalyst optimizer, post query plan optimization, compiles the SQL query to Java code. Without code generation, such query expressions would have to be interpreted for each row of data, by walking down a tree of nodes. This introduces large amounts of branches and virtual function calls that slow down execution. With code generation, a query is collapsed into a single optimized function that eliminates multiple function calls and leverages CPU registers for intermediate data.
This code is then compiled in runtime to Java bytecode using Janino compiler. This presentation focuses on further catalyst code generation optimizations possible using function outlining. Automatic code generation tools generally tend to generate huge optimized functions. Large functions that are frequently executed might degrade runtime performance by preventing JVM optimizations such as function inlining. To avoid this, code generation tools should try to contain independent logic into separate functions.
This presentation will take the audience through the Spark Catalyst Code generation, how automatic split of large functions into smaller functions was achieved and the performance benefits associated with it
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
HotSpot promises to do the "right" thing for us by identifying our hot code and compiling "just-in-time", but how does HotSpot make those decisions?
This presentation aims to detail how HotSpot makes those decisions and how it corrects its mistakes through a series of demos that you run yourself.
HKG15-300: Art's Quick Compiler: An unofficial overviewLinaro
HKG15-300: Art's Quick Compiler: An unofficial overview
---------------------------------------------------
Speaker: Matteo Franchin
Date: February 11, 2015
---------------------------------------------------
★ Session Summary ★
One of the important technical novelties introduced with the recent release of Android Lollipop is the replacement of Dalvik, the VM which was used to execute the bytecode produced from Java apps, with ART, a new Android Run-Time. One interesting aspect in this upgrade is that the use of Just-In-Time compilation was abandoned in favour of Ahead-Of-Time compilation. This delivers better performance [1], also leaving a good margin for future improvements. ART was designed to support multiple compilers. The compiler that shipped with Android Lollipop is called the “Quick Compiler”. This is simple, fast, and is derived from Dalvik’s JIT compiler. In 2014 our team at ARM worked in collaboration with Google to extend ART and its Quick Compiler to add support for 64-bit and for the A64 instruction set. These efforts culminated with the recent release of the Nexus 9 tablet, the first 64-bit Android product to hit the market. Despite Google’s intention of replacing the Quick Compiler with the so-called “Optimizing Compiler”, the job for the the Quick Compiler is not yet over. Indeed, the Quick Compiler will remain the only usable compiler in Android Lollipop. Therefore, all competing parties in the Android ecosystem have a huge interest in investigating and improving this component, which will very likely be one of the battlegrounds in the Android benchmark wars of 2015. This talk aims to give an unofficial overview of ART’s Quick compiler. It will first focus on the internal organisation of the compiler, adopting the point of view of a developer who is interested in understanding its limitations and strengths. The talk will then move to exploring the output produced by the compiler, discussing possible strategies for improving the generated code, while keeping in mind that this component may have a limited life-span, and that any long-term work would be better directed towards the Optimizing Compiler. [1] The ART runtime, B. Carlstrom, A. Ghuloum, and I. Rogers, Google I/O 2014,https://www.youtube.com/watch?v=EBlTzQsUoOw
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250804
Video: https://www.youtube.com/watch?v=iho-e7EPHk0
Etherpad: N/A
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Optimizing NN inference performance on Arm NEON and Vulkanax inc.
This talk starts with an overview of ailia SDK, then introduces optimization techniques for inferring neural networks at high speed in Arm environments. Based on our research for developing ailia SDK, we introduce the optimization for Arm CPU using NEON SIMD instructions and various optimal compute shader implementations for Arm Mali using Vulkan. In addition, we demonstrate how various machine learning models actually operate at high speed in Arm environments.
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
This is a presentation deck for Spark AI Summit 2020 at
https://databricks.com/session_na20/sql-performance-improvements-at-a-glance-in-apache-spark-3-0
Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Easy and High Performance GPU Programming for Java Programmers
1. GTC 2016
Kazuaki Ishizaki (kiszk@acm.org) +, Gita Koblents -,
Alon Shalev Housfater -, Jimmy Kwa -, Marcel Mitran –,
Akihiro Hayashi *, Vivek Sarkar *
+ IBM Research – Tokyo
- IBM Canada
* Rice University
Easy and High Performance
GPU Programming for Java Programmers
1
2. Java Program Runs on GPU with IBM Java 8
2 Easy and High Performance GPU Programming for Java Programmers
http://www-01.ibm.com/support/docview.wss?uid=swg21696670 https://devblogs.nvidia.com/parallelforall/
next-wave-enterprise-performance-java-power-systems-nvidia-gpus/
3. Java Meets GPUs
3 Easy and High Performance GPU Programming for Java Programmers
4. What You Will Learn from this Talk
How to program GPUs in pure Java
– using standard parallel stream APIs
How IBM Java 8 runtime executes the parallel program on
GPUs
– with optimizations without annotations
GPU read-only cache exploitation
data copy reductions between CPU and GPU
exception check eliminations for Java
Achieve good performance results using one K40 card with
– 58.9x over 1-CPU-thread sequential execution on POWER8
– 3.7x over 160-CPU-thread parallel execution on POWER8
4 Easy and High Performance GPU Programming for Java Programmers
5. Outline
Goal
Motivation
How to Write a Parallel Program in Java
Overview of IBM Java 8 Runtime
Performance Evaluation
Conclusion
5 Easy and High Performance GPU Programming for Java Programmers
6. Why We Want to Use Java for GPU Programming
High productivity
– Safety and flexibility
– Good program portability among different machines
“write once, run anywhere”
– Ease of writing a program
Hard to use CUDA and OpenCL for non-expert programmers
Many computation-intensive applications in non-HPC area
– Data analytics and data science (Hadoop, Spark, etc.)
– Security analysis (events in log files)
– Natural language processing (messages in social network system)
6 Easy and High Performance GPU Programming for Java Programmers From https://www.flickr.com/photos/dlato/5530553658
7. Programmability of CUDA vs. Java for GPUs
CUDA requires programmers to explicitly write operations for
– managing device memories
– copying data
between CPU and GPU
– expressing parallelism
Java 8 enables programmers
to just focus on
– expressing parallelism
7 Easy and High Performance GPU Programming for Java Programmers
// code for GPU
__global__ void GPU(float* d_a, float* d_b, int n) {
int i = threadIdx.x;
if (n <= i) return;
d_b[i] = d_a[i] * 2.0;
}
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
});
}
void fooCUDA(N, float *A, float *B, int N) {
int sizeN = N * sizeof(float);
cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);
cudaMemcpy(d_A, A, sizeN, HostToDevice);
GPU<<<N, 1>>>(d_A, d_B, N);
cudaMemcpy(B, d_B, sizeN, DeviceToHost);
cudaFree(d_B); cudaFree(d_A);
}
8. Safety and Flexibility in Java
Automatic memory management
– No memory leak
Object-oriented
Exception checks
– No unsafe
memory accesses
8 Easy and High Performance GPU Programming for Java Programmers
float[] a = new float[N], b = new float[N]
new Par().foo(a, b, N)
// unnecessary to explicitly free a[] and b[]
class Par {
void foo(float[] a, float[] b, int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> {
// throw an exception if
// a[] == null, b[] = null
// i < 0, a.length <= i, b.length <= i
b[i] = a[i] * 2.0;
});
}
}
9. Portability among Different Hardware
How a Java program works
– ‘javac’ command creates machine-independent Java bytecode
– ‘java’ command launches Java runtime with Java bytecode
An interpreter executes a program by processing each Java bytecode
A just-in-time compiler generates native instructions for a target machine
from Java bytecode of a hotspot method
9 Easy and High Performance GPU Programming for Java Programmers
Java
program
(.java)
Java
bytecode
(.class,
.jar)
Java runtime
Target machine
Interpreter
just-in-time
compiler> javac Seq.java > java Seq
10. Outline
Goal
Motivation
How to Write a Parallel Program in Java
Overview of IBM Java 8 Runtime
Performance Evaluation
Conclusion
10 Easy and High Performance GPU Programming for Java Programmers
11. How to Write a Parallel Loop in Java 8
Express parallelism by using parallel stream APIs
among iterations of a lambda expression (index variable: i)
11 Easy and High Performance GPU Programming for Java Programmers
IntStream.range(0, 5).parallel().
forEach(i -> { System.out.println(i);}); 0
3
2
4
1
Example
Reference implementation of Java 8 can execute this
on multiple CPU threads
println(0) on thread 0
println(3) on thread 1
println(2) on thread 2
println(4) on thread 3
println(1) on thread 0
time
12. Outline
Goal
Motivation
How to Write and Execute a Parallel Program in Java
Overview of IBM Java 8 Runtime
Performance Evaluation
Conclusion
12 Easy and High Performance GPU Programming for Java Programmers
13. Portability among Different Hardware (including GPUs)
A just-in-time compiler in IBM Java 8 runtime generates
native instructions
– for a target machine including GPUs from Java bytecode
– for GPU which exploit device-specific capabilities more easily than
OpenCL
13 Easy and High Performance GPU Programming for Java Programmers
Java
program
(.java)
Java
bytecode
(.class,
.jar)
IBM Java 8 runtime
Target machine
Interpreter
just-in-time
compiler
> javac Par.java > java Par for GPU
IntStream.range(0, n)
.parallel().forEach(i -> {
...
});
14. IBM Java 8 Can Execute the Code on CPU or GPU
Generate code for GPU execution from a parallel loop
– GPU instructions for code in blue
– CPU instructions for GPU memory manage and data copy
Execute this loop on CPU or GPU base on cost model
– e.g., execute this on CPU if ‘n’ is very small
14 Easy and High Performance GPU Programming for Java Programmers
class Par {
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
Note: GPU support in current version is limited to lambdas with one-dimensional arrays and primitive types
15. Optimizations for GPUs in IBM Just-In-Time Compiler
Using read-only cache
– reduce # of memory transactions to a GPU global memory
Optimizing data copy between CPU and GPU
– reduce amount of data copy
Eliminating redundant exception checks for Java on GPU
– reduce # of instructions in GPU binary
15 Easy and High Performance GPU Programming for Java Programmers
16. Using Read-Only Cache
Automatically detect a read-only array and access it thru read-
only cache
– read-only cache is faster than other memories in GPU
16 Easy and High Performance GPU Programming for Java Programmers
float[] A = new float[N], B = new float[N], C = new float[N];
foo(A, B, C, N);
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
Equivalent to CUDA code
__device__ foo(*a, *b, *c, N)
b[i] = __ldg(&a[i]) * 2.0;
c[i] = __ldg(&a[i]) * 3.0;
}
17. Optimizing Data Copy between CPU and GPU
Eliminate data copy from GPU to CPU
– if an array (e.g., a[]) is not written on GPU
Eliminate data copy from CPU to GPU
– if an array (e.g., b[] and c[]) is not read on GPU
17 Easy and High Performance GPU Programming for Java Programmers
void foo(float[] a, float[] b, float[] c, int n) {
// Data copy for a[] from CPU to GPU
// No data copy for b[] and c[]
IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
// Data copy for b[] and c[] from GPU to CPU
// No data copy for a[]
}
18. Optimizing Data Copy between CPU and GPU
Eliminate data copy between CPU and GPU
– if an array (e.g., a[] and b[]), which was accessed on GPU, is not
accessed on CPU
18 Easy and High Performance GPU Programming for Java Programmers
// Data copy for a[] from CPU to GPU
for (int t = 0; t < T; t++) {
IntStream.range(0, N*N).parallel().forEach(idx -> {
b[idx] = a[...];
});
// No data copy for b[] between GPU and CPU
IntStream.range(0, N*N).parallel().forEach(idx -> {
a[idx] = b[...];
}
// No data copy for a[] between GPU and CPU
}
// Data copy for a[] and b[] from GPU to CPU
19. How to Support Exception Checks on GPUs
IBM just-in-time compiler inserts exception checks in GPU
kernel
19 Easy and High Performance GPU Programming for Java Programmers
// code for CPU
{
...
launch GPUkernel(...)
if (exception) {
goto handle_exception;
}
...
}
__device__ GPUkernel(…) {
int i = ...;
if ((a == NULL) || i < 0 || a.length <= i) {
exception = true; return; }
if ((b == NULL) || b.length <= i) {
exception = true; return; }
b[i] = a[i] * 2.0;
if ((c == NULL) || c.length <= i) {
exception = true; return; }
c[i] = a[i] * 3.0;
}
// Java program
IntStream.range(0,n).parallel().
forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
20. Eliminating Redundant Exception Checks
Speculatively perform exception checks on CPU if the form of
an array index is simple (xi + y)
20 Easy and High Performance GPU Programming for Java Programmers
// code for CPU
if (
// check conditions for null pointer
a != null && b != null && c != null &&
// check conditions for out of bounds of array index
0 <= a.length && a.length < n &&
0 <= b.length && b.length < n &&
0 <= c.length && c.length < n) {
...
launch GPUkernel(...)
...
} else {
// execute this loop on CPU to produce an exception
}
__device__ GPUkernel(…) {
// no exception check is
// required
i = ...;
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
}
IntStream.range(0,n).parallel().
forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
21. Outline
Goal
Motivation
How to Write and Execute a Parallel Program in Java
Overview of IBM Java 8 Runtime
Performance Evaluation
Conclusion
21 Easy and High Performance GPU Programming for Java Programmers
22. Performance Evaluation Methodology
Measured performance improvement by GPU using four programs (on
next slide) over
– 1-CPU-thread sequential execution
– 160-CPU-thread parallel execution
Experimental environment used
– IBM Java 8 Service Release 2 for PowerPC Little Endian
Download for free at http://www.ibm.com/java/jdk/
– Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB
memory (160 hardware threads in total)
With one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz
with 12GB global memory (ECC off)
– Ubuntu 14.10, CUDA 5.5
22 Easy and High Performance GPU Programming for Java Programmers
23. Benchmark Programs
Prepare sequential and parallel stream API versions in Java
23 Easy and High Performance GPU Programming for Java Programmers
Name Summary Data size Type
MM A dense matrix multiplication: C = A.B 1,024 × 1,024 double
SpMM A sparse matrix multiplication: C = A.B 500,000×
500,000
double
Jacobi2D Solve an equation using the Jacobi method 8,192 × 8,192 double
LifeGame Conway’s game of life. Iterate 10,000 times 512 × 512 byte
24. Performance Improvements of GPU Version over
Sequential and Parallel CPU Versions
Achieve 58.9x on geomean and 317.0x for Jacobi2D over 1 CPU thread
Achieve 3.7x on geomean and 14.8x for Jacobi2D over 160 CPU threads
Degrade performance for SpMM against 160 CPU threads
Easy and High Performance GPU Programming for Java Programmers24
25. Conclusion
Program GPUs using pure Java with standard parallel stream
APIs
Compile a Java program without annotations for GPUs by IBM
Java 8 runtime with optimizations
– read-only cache exploitation
– data copy optimizations between CPU and GPU
– exception check eliminations
Offer performance improvements using GPUs by
–58.9x over sequential execution
–3.7x over 160-CPU-thread parallel execution
25 Easy and High Performance GPU Programming for Java Programmers
Details are in our paper “Compiling and Optimizing Java 8 Programs for GPU Execution” (PACT2015)