Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Java on the GPU: Where are we now?

742 views

Published on

Slides from Devoxx Belgium 2017

Published in: Software
  • Be the first to comment

Java on the GPU: Where are we now?

  1. 1. Java and GPU: where are we now? And why? 2
  2. 2. Dmitry Alexandrov T-Systems | @bercut2000 3
  3. 3. 4
  4. 4. 5
  5. 5. What is a video card? A video card (also called a display card, graphics card, display adapter or graphics adapter) is an expansion card which generates a feed of output images to a display (such as a computer monitor). Frequently, these are advertised as discrete or dedicated graphics cards, emphasizing the distinction between these and integrated graphics. 6
  6. 6. What is a video card? But as for today: Video cards are not limited to simple image output, they have a built-in graphics processor that can perform additional processing, removing this task from the central processor of the computer. 7
  7. 7. So what does it do? 8
  8. 8. 9
  9. 9. What is a GPU? • Graphics Processing Unit 10
  10. 10. What is a GPU? • Graphics Processing Unit • First used by Nvidia in 1999 11
  11. 11. What is a GPU? • Graphics Processing Unit • First used by Nvidia in 1999 • GeForce 256 is called as «The world’s first GPU» 12
  12. 12. What is a GPU? • Defined as “single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines capable of processing of 10000 polygons per second” 13
  13. 13. What is a GPU? • Defined as “single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines capable of processing of 10000 polygons per second” • ATI called them VPU.. 14
  14. 14. By idea it looks like this 15
  15. 15. GPGPU • General-purpose computing on graphics processing units 16
  16. 16. GPGPU • General-purpose computing on graphics processing units • Performs not only graphic calculations.. 17
  17. 17. GPGPU • General-purpose computing on graphics processing units • Performs not only graphic calculations.. • … but also those usually performed on CPU 18
  18. 18. So much cool! We have to use them! 19
  19. 19. Let’s look at the hardware! 20 Based on “From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University
  20. 20. The CPU in general looks like this 21
  21. 21. How to convert? 22
  22. 22. Let’s simplify! 23
  23. 23. Then let’s just clone them 24
  24. 24. To make a lot of them! 25
  25. 25. But we are doing the same calculation just with different data 26
  26. 26. So we come to SIMD paradigm 27
  27. 27. So we use this paradigm 28
  28. 28. And here we start to talk about vectors.. 29
  29. 29. … and in the and we are here: 30
  30. 30. Nice! But how on earth can we code here?! 31
  31. 31. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU 32
  32. 32. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU • But the most of the algorithms we just “hardcoded” 33
  33. 33. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU • But the most of the algorithms we just “hardcoded” • They were considered “standard” 34
  34. 34. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU • But the most of the algorithms we just “hardcoded” • They were considered “standard” • Developers were able just to call them 35
  35. 35. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms 36
  36. 36. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms • That’s why some of the vendors “opened access” for developers to use their own algorithms with own programs 37
  37. 37. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms • That’s why some of the vendors “opened access” for developers to use their own algorithms with own programs • These programs are called Shaders 38
  38. 38. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms • That’s why some of the vendors “opened access” for developers to use their own algorithms with own programs • These programs are called Shaders • From this moment video card could work on transformations, geometry and textures as the developers want! 39
  39. 39. It all started with a shader • First shadres were different: • Vertex • Geometry • Pixel • Then they were united to Common Shader Architecture 40
  40. 40. There are several shaders languages • RenderMan • OSL • GLSL • Cg • DirectX ASM • HLSL • … 41
  41. 41. As an example: 42
  42. 42. With or without them 43
  43. 43. But they are so low level.. 44
  44. 44. Having in mind it all started with gaming… 45
  45. 45. Several abstractions were created: • OpenGL • is a cross-language, cross-platform application programming interface (API) for rendering 2D and 3Dvector graphics. The API is typically used to interact with a graphics processing unit (GPU), to achieve hardware-accelerated rendering. • Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it in January 1992; • DirectX • is a collection of application programming interfaces (APIs) for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms. Originally, the names of these APIs all began with Direct, such as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. The name DirectX was coined as a shorthand term for all of these APIs (the X standing in for the particular API names) and soon became the name of the collection. 46
  46. 46. By the way, what about Java? 47
  47. 47. OpenGL in Java • JSR – 231 • Started in 2003 • Latest release in 2008 • Supports OpenGL 2.0 48
  48. 48. OpenGL • Now is an independent project GOGL • Supports OpenGL up to 4.5 • Provide support for GLU и GLUT • Access to low level API on С via JNI 49
  49. 49. 50
  50. 50. But somewhere in 2005 it was finally realized this can be used for general computations as well 51
  51. 51. BrookGPU • Early efforts to use GPGPU • Own subset of ANSI C • Brook Streaming Language • Made in Stanford University 52
  52. 52. GPGPU • CUDA — Nvidia C subset proprietary platform. • DirectCompute — Microsoft proprietary shader language, part of Direct3d, starting from DirectX 10. • AMD FireStream — ATI proprietary technology. • OpenACC – multivendor consortium • C++ AMP – Microsoft proprietary language • OpenCL – Common standard controlled by Kronos group. 53
  53. 53. Why should we ever use GPU on Java • Why Java • Safe and secure • Portability (“write once, run everywhere”) • Used on 3 000 000 000 devices 54
  54. 54. Why should we ever use GPU on Java • Why Java • Safe and secure • Portability (“write once, run everywhere”) • Used on 3 000 000 000 devices • Where can we apply GPU • Data Analytics and Data Science (Hadoop, Spark …) • Security analytics (log processing) • Finance/Banking 55
  55. 55. For this we have: 56
  56. 56. But Java works on JVM.. But there we have some low level.. 57
  57. 57. For low level we use: • JNI (Java Native Interface) • JNA (Java Native Access) 58
  58. 58. But we can go crazy there.. 59
  59. 59. Someone actually did this… 60
  60. 60. But may be there is something done already? 61
  61. 61. For OpenCL: • JOCL • JogAmp • JavaCL (not supported anymore) 62
  62. 62. .. and for Cuda • JCuda • Cublas • JCufft • JCurand • JCusparse • JCusolver • Jnvgraph • Jcudpp • JNpp • JCudnn 63
  63. 63. Disclaimer: its hard to work with GPU! • Its not just run a program • You need to know your hardware! • Its low level.. 64
  64. 64. Let’s start with: 65
  65. 65. What’s that? • Short for Open Compute Language • Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola and many more • Very abstract model • Works both on GPU and CPU 66
  66. 66. Should work on everything 67
  67. 67. All in all it works like this: HOST DEVICE Data Program/Kernel 68
  68. 68. All in all it works like this: HOST 69
  69. 69. All in all it works like this: HOST DEVICE Result 70
  70. 70. Typical lifecycle of an OpenCL app • Create context • Create command queue • Create memory buffers/fill with data • Create program from sources/load binaries • Compile (if required) • Create kernel from the program • Supply kernel arguments • Define ND range • Execute • Return resulting data • Release resources 71
  71. 71. Better take a look 72
  72. 72. 73
  73. 73. 1. There is the host code. Its on Java. 74
  74. 74. 2. There is a device code. A specific subset of C. 75
  75. 75. 3. Communication between the host and the device is done via memory buffers. 76
  76. 76. So what can we actually transfer? 77
  77. 77. The data is not quite the same.. 78
  78. 78. Datatypes: scalars 79
  79. 79. Datatypes:vectors 80
  80. 80. Datatypes:vectors float f = 4.0f; float3 f3 = (float3)(1.0f, 2.0f, 3.0f); float4 f4 = (float4)(f3, f); //f4.x = 1.0f, //f4.y = 2.0f, //f4.z = 3.0f, //f4.w = 4.0f 81
  81. 81. So how are they saved there? 82
  82. 82. So how are they saved there? In a hard way.. 83
  83. 83. Memory Model • __global • __constant • __local • __private 84
  84. 84. Memory Model 85
  85. 85. But that’s not all 86
  86. 86. Remember SIMD? 87
  87. 87. Execution model • We’ve got a lot of data • We need to perform the same computations over them • So we can just shard them • OpenCL is here t help us 88
  88. 88. Execution model 89
  89. 89. ND Range – what is that? 90
  90. 90. For example: matrix multiplication • We would write it like this: void MatrixMul_sequential(int dim, float *A, float *B, float *C) { for(int iRow=0; iRow<dim;++iRow) { for(int iCol=0; iCol<dim;++iCol) { float result = 0.f; for(int i=0; i<dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } } } 91
  91. 91. For example: matrix multiplication 92
  92. 92. For example: matrix multiplication • So on GPU: void MatrixMul_kernel_basic(int dim, __global float *A, __global float *B, __global float *C) { //Get the index of the work-item int iCol = get_global_id(0); int iRow = get_global_id(1); float result = 0.0; for(int i=0;i< dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } 93
  93. 93. For example: matrix multiplication • So on GPU: void MatrixMul_kernel_basic(int dim, __global float *A, __global float *B, __global float *C) { //Get the index of the work-item int iCol = get_global_id(0); int iRow = get_global_id(1); float result = 0.0; for(int i=0;i< dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } 94
  94. 94. Typical GPU --- Info for device GeForce GT 650M: --- CL_DEVICE_NAME: GeForce GT 650M CL_DEVICE_VENDOR: NVIDIA CL_DRIVER_VERSION: 10.14.20 355.10.05.15f03 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 2 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 256 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT CL_DEVICE_2D_MAX_WIDTH 16384 CL_DEVICE_2D_MAX_HEIGHT 16384 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1 95
  95. 95. Typical CPU --- Info for device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: --- CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CL_DEVICE_VENDOR: Intel CL_DRIVER_VERSION: 1.1 CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU CL_DEVICE_MAX_COMPUTE_UNITS: 8 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: global CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT CL_DEVICE_2D_MAX_WIDTH 8192 CL_DEVICE_2D_MAX_HEIGHT 8192 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2 96
  96. 96. And what about CUDA? 97
  97. 97. And what about CUDA? Well.. It looks to be easier 98
  98. 98. And what about CUDA? Well.. It looks to be easier for C developers… 99
  99. 99. CUDA kernel #define N 10 __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; // this thread handles the data at its thread id if (tid < N) c[tid] = a[tid] + b[tid]; } 100
  100. 100. CUDA setup int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; // allocate the memory on the GPU cudaMalloc( (void**)&dev_a, N * sizeof(int) ); cudaMalloc( (void**)&dev_b, N * sizeof(int) ); cudaMalloc( (void**)&dev_c, N * sizeof(int) ); // fill the arrays 'a' and 'b' on the CPU for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } 101
  101. 101. CUDA copy to memory and run // copy the arrays 'a' and 'b' to the GPU cudaMemcpy(dev_a, a, N *sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_b,b,N*sizeof(int), cudaMemcpyHostToDevice); add<<<N,1>>>(dev_a,dev_b,dev_c); // copy the array 'c' back from the GPU to the CPU cudaMemcpy(c,dev_c,N*sizeof(int), cudaMemcpyDeviceToHost); 102
  102. 102. CUDA get results // display the results for (int i=0; i<N; i++) { printf( "%d + %d = %dn", a[i], b[i], c[i] ); } // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); 103
  103. 103. But CUDA has some other superpowers • Cublas – all about matrices • JCufft – Fast Frontier Transformation • Jcurand – all about random • JCusparse – sparse matrices • Jcusolver – factorization and some other crazy stuff • Jnvgraph – all about graphs • Jcudpp – CUDA Data Parallel Primitives Library, and some sorting • JNpp – image processing on GPU • Jcudnn – Deep Neural Network library (that’s scary) 104
  104. 104. For example we need a good rand int n = 100; curandGenerator generator = new curandGenerator(); float hostData[] = new float[n]; Pointer deviceData = new Pointer(); cudaMalloc(deviceData, n * Sizeof.FLOAT); curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT); curandSetPseudoRandomGeneratorSeed(generator, 1234); curandGenerateUniform(generator, deviceData, n); cudaMemcpy(Pointer.to(hostData), deviceData, n * Sizeof.FLOAT, cudaMemcpyDeviceToHost); System.out.println(Arrays.toString(hostData)); curandDestroyGenerator(generator); cudaFree(deviceData); 105
  105. 105. For example we need a good rand • With a strong theory underneath • Developed by Russian mathematician Ilya Sobolev back in 1967 • https://en.wikipedia.org/wiki/Sobol_sequence 106
  106. 106. nVidia memory looks like this 107
  107. 107. Btw.. Talking about memory 108 ©Wikipedia
  108. 108. Optimizations… __kernel void MatrixMul_kernel_basic(int dim, __global float *A, __global float *B, __global float *C){ int iCol = get_global_id(0); int iRow = get_global_id(1); float result = 0.0; for(int i=0;i< dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } 109
  109. 109. <—Optimizations #define VECTOR_SIZE 4 __kernel void MatrixMul_kernel_basic_vector4(int dim, __global float4 *A, __global float4 *B, __global float *C) int localIdx = get_global_id(0); int localIdy = get_global_id(1); float result = 0.0; float4 Bvector[4]; float4 Avector, temp; float4 resultVector[4] = {0,0,0,0}; int rowElements = dim/VECTOR_SIZE; for(int i=0; i<rowElements; ++i){ Avector = A[localIdy*rowElements + i]; Bvector[0] = B[dim*i + localIdx]; Bvector[1] = B[dim*i + rowElements + localIdx]; Bvector[2] = B[dim*i + 2*rowElements + localIdx]; Bvector[3] = B[dim*i + 3*rowElements + localIdx]; temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x); resultVector[0] += Avector * temp; temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y); resultVector[1] += Avector * temp; temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z); resultVector[2] += Avector * temp; temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w); resultVector[3] += Avector * temp; } C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w; } 110
  110. 110. <—Optimizations #define VECTOR_SIZE 4 __kernel void MatrixMul_kernel_basic_vector4(int dim, __global float4 *A, __global float4 *B, __global float *C) int localIdx = get_global_id(0); int localIdy = get_global_id(1); float result = 0.0; float4 Bvector[4]; float4 Avector, temp; float4 resultVector[4] = {0,0,0,0}; int rowElements = dim/VECTOR_SIZE; for(int i=0; i<rowElements; ++i){ Avector = A[localIdy*rowElements + i]; Bvector[0] = B[dim*i + localIdx]; Bvector[1] = B[dim*i + rowElements + localIdx]; Bvector[2] = B[dim*i + 2*rowElements + localIdx]; Bvector[3] = B[dim*i + 3*rowElements + localIdx]; temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x); resultVector[0] += Avector * temp; temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y); resultVector[1] += Avector * temp; temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z); resultVector[2] += Avector * temp; temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w); resultVector[3] += Avector * temp; } C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w; } 111
  111. 111. But we don’t want to have C at all… 112
  112. 112. We don’t want to think about those hosts and devices… 113
  113. 113. We can use GPU partially.. 114
  114. 114. Project Sumatra • Research project 115
  115. 115. Project Sumatra • Research project • Focused on Java 8 116
  116. 116. Project Sumatra • Research project • Focused on Java 8 • … to be more precise on streams 117
  117. 117. Project Sumatra • Research project • Focused on Java 8 • … to be more precise on streams • … and even more precise lambdas and .forEach() 118
  118. 118. AMD HSAIL 119
  119. 119. AMD HSAIL 120
  120. 120. AMD HSAIL • Detects forEach() block • Gets HSAIL code with Graal • On low level supply the generated from lambda kernel to the GPU 121
  121. 121. AMD APU tries to solve the main issue.. 122 ©Wikipedia
  122. 122. But if we want some more general solution.. 123
  123. 123. IBM patched JVM for GPU • Focused on CUDA (for now) • Focused on Stream API • Created their own .parallel() 124
  124. 124. IBM patched JVM for GPU Imagine: void fooJava(float A[], float B[], int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); } 125
  125. 125. IBM patched JVM for GPU Imagine: void fooJava(float A[], float B[], int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); } … we would like the lambda to be automatically converted to GPU code… 126
  126. 126. IBM patched JVM for GPU When n is big the lambda code is executed on GPU: class Par { void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel() .forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } *only lambdas with primitive types in one dimension arrays. 127
  127. 127. IBM patched JVM for GPU Optimized IBM JIT compiler: • Use read-only cache • Fewer writes to global GPU memory • Optimized Host to Device data copy rate • Fewer data to be copied • Eliminate exceptions as much as possible • In the GPU Kernel 128
  128. 128. IBM patched JVM for GPU • Success story: + + 129
  129. 129. IBM patched JVM for GPU • Officially: 130
  130. 130. IBM patched JVM for GPU • More info: https://github.com/IBMSparkGPU/GPUEnabler 131
  131. 131. But can we just write in Java, and its just being converted to OpenCL/CUDA? 132
  132. 132. Yes, you can! 133
  133. 133. Aparapi is there for you! 134
  134. 134. Aparapi • Short for «A PARallel API» 135
  135. 135. Aparapi • Short for «A PARallel API» • Works like Hibernate for databases 136
  136. 136. Aparapi • Short for «A PARallel API» • Works like Hibernate for databases • Dynamically converts JVM Bytecode to code for Host and Device 137
  137. 137. Aparapi • Short for «A PARallel API» • Works like Hibernate for databases • Dynamically converts JVM Bytecode to code for Host and Device • OpenCL under the cover 138
  138. 138. Aparapi • Started by AMD 139
  139. 139. Aparapi • Started by AMD • Then abandoned… 140
  140. 140. Aparapi • Started by AMD • Then abandoned… • In 5 years Opensourced under Apache 2.0 license 141
  141. 141. Aparapi • Started by AMD • Then abandoned… • In 5 years Opensourced under Apache 2.0 license • Back to life!!! 142
  142. 142. Aparapi – now its so much simple! public static void main(String[] _args) { final int size = 512; final float[] a = new float[size]; final float[] b = new float[size]; for (int i = 0; i < size; i++) { a[i] = (float) (Math.random() * 100); b[i] = (float) (Math.random() * 100); } final float[] sum = new float[size]; Kernel kernel = new Kernel(){ @Override public void run() { int gid = getGlobalId(); sum[gid] = a[gid] + b[gid]; } }; kernel.execute(Range.create(size)); for (int i = 0; i < size; i++) { System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]); } kernel.dispose(); } 143
  143. 143. But what about the clouds? 144
  144. 144. We can’t sell our product if its not cloud native! 145
  145. 145. nVidia is your friend! 146
  146. 146. nVidia GRID • Announced in 2012 • Already in production • Works on the most of the hypervisors • .. And in the clouds! 147
  147. 147. nVidia GRID 148
  148. 148. nVidia GRID 149
  149. 149. … AMD is a bit behind… 150
  150. 150. Anyway, its here! 151
  151. 151. Its here: Nvidia GPU 152
  152. 152. Its here : ATI Radeon 153
  153. 153. Its here: AMD APU 154
  154. 154. Its here: Intel Skylake 155
  155. 155. Its here: Nvidia Tegra Parker 156
  156. 156. Intel with VEGA?? 157
  157. 157. But first read: 158
  158. 158. So use it! 159
  159. 159. So use it! If the task is suitable 160
  160. 160. …its hard, but worth it! 161
  161. 161. You will rule’em’all! 162
  162. 162. Thanks! Dank je! Merci beaucoup! 163
  163. 163. 164

×