Java on the GPU: Where are we now?

Java and GPU: where are we
now?
And why?
2

Dmitry Alexandrov
T-Systems | @bercut2000
3

What is a video card?
A video card (also called a display card, graphics card, display
adapter or graphics adapter) is an expansion card which generates a
feed of output images to a display (such as a computer monitor).
Frequently, these are advertised as discrete or dedicated graphics
cards, emphasizing the distinction between these and integrated
graphics.
6

What is a video card?
But as for today:
Video cards are not limited to simple image output, they have a built-in
graphics processor that can perform additional processing, removing
this task from the central processor of the computer.
7

What is a GPU?
• Graphics Processing Unit
10

What is a GPU?
• First used by Nvidia in 1999
11

What is a GPU?
• First used by Nvidia in 1999
• GeForce 256 is called as «The world’s first GPU»
12

What is a GPU?
• Defined as “single-chip processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines capable of processing
of 10000 polygons per second”
13

What is a GPU?
• Defined as “single-chip processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines capable of processing
of 10000 polygons per second”
• ATI called them VPU..
14

GPGPU
• General-purpose computing on graphics processing units
16

GPGPU
• Performs not only graphic calculations..
17

GPGPU
• Performs not only graphic calculations..
• … but also those usually performed on CPU
18

So much cool! We have to use
them!
19

Let’s look at the hardware!
20
Based on
“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University

The CPU in general looks like this
21

Then let’s just clone them
24

But we are doing the same
calculation just with different
data
26

So we come to SIMD paradigm
27

And here we start to talk about vectors..
29

… and in the and we are here:
30

Nice! But how on earth can we
code here?!
31

It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
32

• But the most of the algorithms we just “hardcoded”
33

• They were considered “standard”
34

• They were considered “standard”
• Developers were able just to call them
35

• But its obvious, not everything can be done with “hardcoded” algorithms
36

• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
37

• These programs are called Shaders
38

• These programs are called Shaders
• From this moment video card could work on transformations, geometry and
textures as the developers want!
39

• First shadres were different:
• Vertex
• Geometry
• Pixel
• Then they were united to Common Shader Architecture
40

There are several shaders languages
• RenderMan
• OSL
• GLSL
• Cg
• DirectX ASM
• HLSL
• …
41

But they are so low level..
44

Having in mind it all started with
gaming…
45

Several abstractions were created:
• OpenGL
• is a cross-language, cross-platform application programming interface (API) for
rendering 2D and 3Dvector graphics. The API is typically used to interact with
a graphics processing unit (GPU), to achieve hardware-accelerated rendering.
• Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it in
January 1992;
• DirectX
• is a collection of application programming interfaces (APIs) for handling tasks related
to multimedia, especially game programming and video, on Microsoft platforms.
Originally, the names of these APIs all began with Direct, such
as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. The
name DirectX was coined as a shorthand term for all of these APIs (the X standing in
for the particular API names) and soon became the name of the collection.
46

By the way, what about Java?
47

OpenGL in Java
• JSR – 231
• Started in 2003
• Latest release in 2008
• Supports OpenGL 2.0
48

OpenGL
• Now is an independent project GOGL
• Supports OpenGL up to 4.5
• Provide support for GLU и GLUT
• Access to low level API on С via JNI
49

But somewhere in 2005 it was
finally realized this can be used
for general computations as well
51

BrookGPU
• Early efforts to use GPGPU
• Own subset of ANSI C
• Brook Streaming Language
• Made in Stanford University
52

GPGPU
• CUDA — Nvidia C subset proprietary platform.
• DirectCompute — Microsoft proprietary shader language, part of
Direct3d, starting from DirectX 10.
• AMD FireStream — ATI proprietary technology.
• OpenACC – multivendor consortium
• C++ AMP – Microsoft proprietary language
• OpenCL – Common standard controlled by Kronos group.
53

Why should we ever use GPU on Java
• Why Java
• Safe and secure
• Portability (“write once, run everywhere”)
• Used on 3 000 000 000 devices
54

Why should we ever use GPU on Java
• Why Java
• Safe and secure
• Portability (“write once, run everywhere”)
• Used on 3 000 000 000 devices
• Where can we apply GPU
• Data Analytics and Data Science (Hadoop, Spark …)
• Security analytics (log processing)
• Finance/Banking
55

But Java works on JVM.. But
there we have some low level..
57

For low level we use:
• JNI (Java Native Interface)
• JNA (Java Native Access)
58

But we can go crazy there..
59

Someone actually did this…
60

But may be there is something
done already?
61

For OpenCL:
• JOCL
• JogAmp
• JavaCL (not supported anymore)
62

.. and for Cuda
• JCuda
• Cublas
• JCufft
• JCurand
• JCusparse
• JCusolver
• Jnvgraph
• Jcudpp
• JNpp
• JCudnn
63

Disclaimer: its hard to work with GPU!
• Its not just run a program
• You need to know your hardware!
• Its low level..
64

What’s that?
• Short for Open Compute Language
• Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola and
many more
• Very abstract model
• Works both on GPU and CPU
66

All in all it works like this:
HOST DEVICE
Data
Program/Kernel
68

HOST
69

HOST DEVICE
Result
70

Typical lifecycle of an OpenCL app
• Create context
• Create command queue
• Create memory buffers/fill with data
• Create program from sources/load binaries
• Compile (if required)
• Create kernel from the program
• Supply kernel arguments
• Define ND range
• Execute
• Return resulting data
• Release resources
71

1. There is the host code. Its on
Java.
74

2. There is a device code. A
specific subset of C.
75

3. Communication between the
host and the device is done via
memory buffers.
76

So what can we actually transfer?
77

The data is not quite the same..
78

Datatypes:vectors
float f = 4.0f;
float3 f3 = (float3)(1.0f, 2.0f, 3.0f);
float4 f4 = (float4)(f3, f);
//f4.x = 1.0f,
//f4.y = 2.0f,
//f4.z = 3.0f,
//f4.w = 4.0f
81

So how are they saved there?
82

So how are they saved there?
In a hard way..
83

Memory Model
• __global
• __constant
• __local
• __private
84

Execution model
• We’ve got a lot of data
• We need to perform the same computations over them
• So we can just shard them
• OpenCL is here t help us
88

For example: matrix multiplication
• We would write it like this:
void MatrixMul_sequential(int dim, float *A, float *B, float *C) {
for(int iRow=0; iRow<dim;++iRow) {
for(int iCol=0; iCol<dim;++iCol) {
float result = 0.f;
for(int i=0; i<dim;++i) {
result += A[iRow*dim + i]*B[i*dim + iCol];
}
C[iRow*dim + iCol] = result;
}
}
}
91

92

• So on GPU:
void MatrixMul_kernel_basic(int dim,
__global float *A, __global float *B, __global float *C) {
//Get the index of the work-item
int iCol = get_global_id(0);
int iRow = get_global_id(1);
float result = 0.0;
for(int i=0;i< dim;++i) {
}
}
93

• So on GPU:
void MatrixMul_kernel_basic(int dim,
__global float *A, __global float *B, __global float *C) {
//Get the index of the work-item
float result = 0.0;
for(int i=0;i< dim;++i) {
}
}
94

Typical GPU
--- Info for device GeForce GT 650M: ---
CL_DEVICE_NAME: GeForce GT 650M
CL_DEVICE_VENDOR: NVIDIA
CL_DRIVER_VERSION: 10.14.20 355.10.05.15f03
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHz
CL_DEVICE_ADDRESS_BITS: 64
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
CL_DEVICE_2D_MAX_WIDTH 16384
CL_DEVICE_2D_MAX_HEIGHT 16384
CL_DEVICE_3D_MAX_DEPTH 2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
95

Typical CPU
--- Info for device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: ---
CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CL_DEVICE_VENDOR: Intel
CL_DRIVER_VERSION: 1.1
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 8
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHz
CL_DEVICE_ADDRESS_BITS: 64
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
CL_DEVICE_3D_MAX_DEPTH 2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
96

And what about CUDA?
Well.. It looks to be easier
98

And what about CUDA?
Well.. It looks to be easier
for C developers…
99

CUDA kernel
#define N 10
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // this thread handles the data at its thread id
if (tid < N)
c[tid] = a[tid] + b[tid];
}
100

CUDA setup
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
cudaMalloc( (void**)&dev_b, N * sizeof(int) );
cudaMalloc( (void**)&dev_c, N * sizeof(int) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
101

CUDA copy to memory and run
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N *sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N*sizeof(int),
cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a,dev_b,dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c,dev_c,N*sizeof(int),
cudaMemcpyDeviceToHost);
102

CUDA get results
// display the results
for (int i=0; i<N; i++) {
printf( "%d + %d = %dn", a[i], b[i], c[i] );
}
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
103

But CUDA has some other superpowers
• Cublas – all about matrices
• JCufft – Fast Frontier Transformation
• Jcurand – all about random
• JCusparse – sparse matrices
• Jcusolver – factorization and some other crazy stuff
• Jnvgraph – all about graphs
• Jcudpp – CUDA Data Parallel Primitives Library, and some sorting
• JNpp – image processing on GPU
• Jcudnn – Deep Neural Network library (that’s scary)
104

For example we need a good rand
int n = 100;
curandGenerator generator = new curandGenerator();
float hostData[] = new float[n];
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, n * Sizeof.FLOAT);
curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(generator, 1234);
curandGenerateUniform(generator, deviceData, n);
cudaMemcpy(Pointer.to(hostData), deviceData,
n * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
System.out.println(Arrays.toString(hostData));
curandDestroyGenerator(generator);
cudaFree(deviceData);
105

For example we need a good rand
• With a strong theory underneath
• Developed by Russian mathematician Ilya Sobolev back in 1967
• https://en.wikipedia.org/wiki/Sobol_sequence
106

nVidia memory looks like this
107

Btw.. Talking about memory
108
©Wikipedia

Optimizations…
__kernel void MatrixMul_kernel_basic(int dim,
__global float *A,
__global float *B,
__global float *C){
float result = 0.0;
for(int i=0;i< dim;++i)
{
result +=
A[iRow*dim + i]*B[i*dim + iCol];
}
}
109

<—Optimizations
#define VECTOR_SIZE 4
__kernel void MatrixMul_kernel_basic_vector4(int dim,
__global float4 *A,
__global float4 *B,
__global float *C)
int localIdx = get_global_id(0);
int localIdy = get_global_id(1);
float result = 0.0;
float4 Bvector[4];
float4 Avector, temp;
float4 resultVector[4] = {0,0,0,0};
int rowElements = dim/VECTOR_SIZE;
for(int i=0; i<rowElements; ++i){
Avector = A[localIdy*rowElements + i];
Bvector[0] = B[dim*i + localIdx];
Bvector[1] = B[dim*i + rowElements + localIdx];
Bvector[2] = B[dim*i + 2*rowElements + localIdx];
temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);
resultVector[0] += Avector * temp;
temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);
temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);
temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);
}
C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;
} 110

<—Optimizations
#define VECTOR_SIZE 4
__kernel void MatrixMul_kernel_basic_vector4(int dim,
__global float4 *A,
__global float4 *B,
__global float *C)
int localIdx = get_global_id(0);
int localIdy = get_global_id(1);
float result = 0.0;
float4 Bvector[4];
float4 Avector, temp;
float4 resultVector[4] = {0,0,0,0};
int rowElements = dim/VECTOR_SIZE;
for(int i=0; i<rowElements; ++i){
Avector = A[localIdy*rowElements + i];
Bvector[0] = B[dim*i + localIdx];
Bvector[1] = B[dim*i + rowElements + localIdx];
temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);
temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);
temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);
temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);
}
C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;
} 111

But we don’t want to have C at
all…
112

We don’t want to think about
those hosts and devices…
113

We can use GPU partially..
114

Project Sumatra
• Research project
115

Project Sumatra
• Focused on Java 8
116

Project Sumatra
• … to be more precise on streams
117

Project Sumatra
• … to be more precise on streams
• … and even more precise lambdas and .forEach()
118

AMD HSAIL
• Detects forEach() block
• Gets HSAIL code with Graal
• On low level supply the
generated from lambda
kernel to the GPU
121

But if we want some more
general solution..
123

IBM patched JVM for GPU
• Focused on CUDA (for now)
• Focused on Stream API
• Created their own .parallel()
124

Imagine:
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });
}
125

Imagine:
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });
}
… we would like the lambda to be automatically converted to GPU code…
126

When n is big the lambda code is executed on GPU:
class Par {
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel()
.forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
*only lambdas with primitive types in one dimension arrays.
127

Optimized IBM JIT compiler:
• Use read-only cache
• Fewer writes to global GPU memory
• Optimized Host to Device data copy rate
• Fewer data to be copied
• Eliminate exceptions as much as possible
• In the GPU Kernel
128

• Success story:
+ +
129

• Officially:
130

• More info:
https://github.com/IBMSparkGPU/GPUEnabler
131

But can we just write in Java,
and its just being converted to
OpenCL/CUDA?
132

Aparapi
• Short for «A PARallel API»
135

Aparapi
• Works like Hibernate for databases
136

Aparapi
• Dynamically converts JVM Bytecode to code for Host and Device
137

Aparapi
• Dynamically converts JVM Bytecode to code for Host and Device
• OpenCL under the cover
138

Aparapi
• Started by AMD
139

Aparapi
• Started by AMD
• Then abandoned…
140

Aparapi
• Started by AMD
• In 5 years Opensourced under Apache 2.0 license
141

Aparapi
• Started by AMD
• In 5 years Opensourced under Apache 2.0 license
• Back to life!!!
142

Aparapi – now its so much simple!
public static void main(String[] _args) {
final int size = 512;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
@Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
kernel.execute(Range.create(size));
for (int i = 0; i < size; i++) {
System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]);
}
kernel.dispose();
}
143

But what about the clouds?
144

We can’t sell our product if its
not cloud native!
145

nVidia GRID
• Announced in 2012
• Already in production
• Works on the most of the
hypervisors
• .. And in the clouds!
147

… AMD is a bit behind…
150

Its here: Nvidia Tegra Parker
156

So use it!
If the task is suitable
160

…its hard,
but worth it!
161

Thanks!
Dank je!
Merci beaucoup!
163

Java on the GPU: Where are we now?

More Related Content

What's hot

Viewers also liked

Similar to Java on the GPU: Where are we now?

Recently uploaded

Java on the GPU: Where are we now?