© ALTOROS Systems | CONFIDENTIAL
“The norm for data analytics is now to run them on commodity clusters with
MapReduce-like abstractions. One only needs to read the popular blogs to see the
evidence of this. We believe that we could now say that
“nobody ever got fired
for using Hadoop on a cluster”!
© ALTOROS Systems | CONFIDENTIAL
Breaking
News
IBM Keynote at JavaOne 2013: Java Flies in Blue Skies and Open Clouds
Java and GPUs open up a world of new opportunities
for GPU accelerators and Java programmers alike.
© ALTOROS Systems | CONFIDENTIAL
Breaking
News
Duimovich showed an example of GPU acceleration
of sorting using standard NVIDIA CUDA libraries
that are already available!
The speedups are phenomenal — ranging from 2x to 48x faster!
© ALTOROS Systems | CONFIDENTIAL
Breaking
News?
© ALTOROS Systems | CONFIDENTIAL
Breaking
News?
© ALTOROS Systems | CONFIDENTIAL
Breaking
Hadoop
© ALTOROS Systems | CONFIDENTIAL
Breaking
Hadoop
10 000x faster
© ALTOROS Systems | CONFIDENTIAL
Breaking
Hadoop
10 000x faster
© ALTOROS Systems | CONFIDENTIAL
Hadoop vs GPU
Hadoop & GPU
Hadoop + GPU
HPC
Big Data
GPGPU in Java
Heterogeneous systems
Horizontal and vertical scalability
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
file01 file02 file03
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
file01 file02 file03
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
file01 file02 file03
Node 1 Node 2 Node 3
01 02 03 04 05 06 07 08 09 10
01
02
03
04
05 0607 0809 10
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
file01 file02 file03
Node 1 Node 2 Node 3
01 02 03 04 05 06 07 08 09 10
01
02
03
04
05 0607 0809 10
3 4 3
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
file01 file02 file03
Node 1 Node 2 Node 3
01 02 03 04 05 06 07 08 09 10
01
02
03
04
05 0607 0809 10
3 4 3
Node 1 Node 2 Node 3
01 02
03 04
05 06
07 08
09 10
Node 4 Node 5 Node 6
01 02 03
04
05 06 07
08 09 10
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
file01 file02 file03
Node 1 Node 2 Node 3
01 02 03 04 05 06 07 08 09 10
01
02
03
04
05 0607 0809 10
3 4 3
Node 1 Node 2 Node 3
01 02
03 04
05 06
07 08
09 10
Node 4 Node 5 Node 6
01 02 03
04
05 06 07
08 09 10
221 1 2 2
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
Node 1 Node 2 Node 3
01 02
03 04
05 06
07 08
09 10
Node 4 Node 5 Node 6
01 02 03
04
05 06 07
08 09 10
221 1 2 2
© ALTOROS Systems | CONFIDENTIAL
Hadoop horizontal scalability
Node 1 Node 2 Node 3
01 02
03 04
05 06
07 08
09 10
Node 4 Node 5 Node 6
01 02 03
04
05 06 07
08 09 10
221 1 2 2
© ALTOROS Systems | CONFIDENTIAL
Use GPU to scale vertically
Node 1 Node 2 Node 3
01 02
03 04
05 06
07 08
09 10
Node 4 Node 5 Node 6
01 02 03
04
05 06 07
08 09 10
221 1 2 20.5 1 1 0.5 1 1
© ALTOROS Systems | CONFIDENTIAL
Profit estimation
“Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU” by Intel
NVidia GTX280
vs
Intel Core i7-960
© ALTOROS Systems | CONFIDENTIAL
Profit estimation
“Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU” by Intel
“OpenCL: the advantages of heterogeneous approach” by Intel
NVidia GTX280
vs
Intel Core i7-960
© ALTOROS Systems | CONFIDENTIAL
How to use OpenCL?
© ALTOROS Systems | CONFIDENTIAL
How to use OpenCL?
© ALTOROS Systems | CONFIDENTIAL
How to use OpenCL?
Hadoop streaming
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
MyKernel.class
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
MyKernel.class
Platform
Supports
OpenCL?
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
MyKernel.class
Platform
Supports
OpenCL?
Execute using
Java Thread Pool
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
MyKernel.class
Platform
Supports
OpenCL?
Bytecode can
be converted
to OpenCL?
Execute using
Java Thread Pool
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
MyKernel.class
Platform
Supports
OpenCL?
Bytecode can
be converted
to OpenCL?
Convert it
Execute OpenCL
Kernel on DeviceExecute using
Java Thread Pool
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
lambda
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Expands Java's “Write Once Run Anywhere” to include APU and GPU devices
by expressing data parallel algorithm through extending Kernel base class.
lambda
HSA
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Characteristics of ideal data parallel workload
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Characteristics of ideal data parallel workload
Code which iterates over large arrays of primitives
- 32/64 bit data types preferred
- where the order of iterations is not critical
avoid data dependencies between iterations
- each iteration contains sequential code (few branches)
© ALTOROS Systems | CONFIDENTIAL
Aparapi
Characteristics of ideal data parallel workload
Code which iterates over large arrays of primitives
- 32/64 bit data types preferred
- where the order of iterations is not critical
avoid data dependencies between iterations
- each iteration contains sequential code (few branches)
Balance between data size (low) and compute (high)
- data transfer to/from the GPU can be costly
- trivial compute not worth the transfer cost
- may still benefit by freeing up CPU for other work(?)
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
Rice University, AMD
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
2 six-core Intel X5660
(48 GB mem)
2 NVidia Tesla M2050
(2*2.5 GB mem)
AMD A10-5800K APU
(16 GB mem)
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
2 six-core Intel X5660
(48 GB mem)
2 NVidia Tesla M2050
(2*2.5 GB mem)
AMD A10-5800K APU
(16 GB mem)
WHY?
© ALTOROS Systems | CONFIDENTIAL
HadoopCL
© ALTOROS Systems | CONFIDENTIAL
Back to OpenCL, Aparapi and heterogeneous computing
© ALTOROS Systems | CONFIDENTIAL
OpenCL, Aparapi and heterogeneous computing
GPU cache
GPU GDDR5
CPU cache
SATA 3.0 (HDD)
SATA 2.0 (SSD)
1 GBit networkFormula in terms of time:
(CPU calc1) + disk read + disk write
>
(CPU calc2 + GPU calc + GPU-write + GPU-read) + disk read + disk write
© ALTOROS Systems | CONFIDENTIAL
OpenCL future
© ALTOROS Systems | CONFIDENTIAL
OpenCL future
http://streamcomputing.eu/
© ALTOROS Systems | CONFIDENTIAL
Questions?
Big Data Experts FB group

Hadoop + GPU

  • 1.
    © ALTOROS Systems| CONFIDENTIAL “The norm for data analytics is now to run them on commodity clusters with MapReduce-like abstractions. One only needs to read the popular blogs to see the evidence of this. We believe that we could now say that “nobody ever got fired for using Hadoop on a cluster”!
  • 2.
    © ALTOROS Systems| CONFIDENTIAL Breaking News IBM Keynote at JavaOne 2013: Java Flies in Blue Skies and Open Clouds Java and GPUs open up a world of new opportunities for GPU accelerators and Java programmers alike.
  • 3.
    © ALTOROS Systems| CONFIDENTIAL Breaking News Duimovich showed an example of GPU acceleration of sorting using standard NVIDIA CUDA libraries that are already available! The speedups are phenomenal — ranging from 2x to 48x faster!
  • 4.
    © ALTOROS Systems| CONFIDENTIAL Breaking News?
  • 5.
    © ALTOROS Systems| CONFIDENTIAL Breaking News?
  • 6.
    © ALTOROS Systems| CONFIDENTIAL Breaking Hadoop
  • 7.
    © ALTOROS Systems| CONFIDENTIAL Breaking Hadoop 10 000x faster
  • 8.
    © ALTOROS Systems| CONFIDENTIAL Breaking Hadoop 10 000x faster
  • 9.
    © ALTOROS Systems| CONFIDENTIAL Hadoop vs GPU Hadoop & GPU Hadoop + GPU HPC Big Data GPGPU in Java Heterogeneous systems Horizontal and vertical scalability
  • 10.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability file01 file02 file03
  • 11.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability file01 file02 file03
  • 12.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability file01 file02 file03 Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 01 02 03 04 05 0607 0809 10
  • 13.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability file01 file02 file03 Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 01 02 03 04 05 0607 0809 10 3 4 3
  • 14.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability file01 file02 file03 Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 01 02 03 04 05 0607 0809 10 3 4 3 Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 Node 4 Node 5 Node 6 01 02 03 04 05 06 07 08 09 10
  • 15.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability file01 file02 file03 Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 01 02 03 04 05 0607 0809 10 3 4 3 Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 Node 4 Node 5 Node 6 01 02 03 04 05 06 07 08 09 10 221 1 2 2
  • 16.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 Node 4 Node 5 Node 6 01 02 03 04 05 06 07 08 09 10 221 1 2 2
  • 17.
    © ALTOROS Systems| CONFIDENTIAL Hadoop horizontal scalability Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 Node 4 Node 5 Node 6 01 02 03 04 05 06 07 08 09 10 221 1 2 2
  • 18.
    © ALTOROS Systems| CONFIDENTIAL Use GPU to scale vertically Node 1 Node 2 Node 3 01 02 03 04 05 06 07 08 09 10 Node 4 Node 5 Node 6 01 02 03 04 05 06 07 08 09 10 221 1 2 20.5 1 1 0.5 1 1
  • 19.
    © ALTOROS Systems| CONFIDENTIAL Profit estimation “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU” by Intel NVidia GTX280 vs Intel Core i7-960
  • 20.
    © ALTOROS Systems| CONFIDENTIAL Profit estimation “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU” by Intel “OpenCL: the advantages of heterogeneous approach” by Intel NVidia GTX280 vs Intel Core i7-960
  • 21.
    © ALTOROS Systems| CONFIDENTIAL How to use OpenCL?
  • 22.
    © ALTOROS Systems| CONFIDENTIAL How to use OpenCL?
  • 23.
    © ALTOROS Systems| CONFIDENTIAL How to use OpenCL? Hadoop streaming
  • 24.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. MyKernel.class
  • 25.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. MyKernel.class Platform Supports OpenCL?
  • 26.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. MyKernel.class Platform Supports OpenCL? Execute using Java Thread Pool
  • 27.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. MyKernel.class Platform Supports OpenCL? Bytecode can be converted to OpenCL? Execute using Java Thread Pool
  • 28.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. MyKernel.class Platform Supports OpenCL? Bytecode can be converted to OpenCL? Convert it Execute OpenCL Kernel on DeviceExecute using Java Thread Pool
  • 29.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class.
  • 30.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class.
  • 31.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class.
  • 32.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. lambda
  • 33.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Expands Java's “Write Once Run Anywhere” to include APU and GPU devices by expressing data parallel algorithm through extending Kernel base class. lambda HSA
  • 34.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Characteristics of ideal data parallel workload
  • 35.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Characteristics of ideal data parallel workload Code which iterates over large arrays of primitives - 32/64 bit data types preferred - where the order of iterations is not critical avoid data dependencies between iterations - each iteration contains sequential code (few branches)
  • 36.
    © ALTOROS Systems| CONFIDENTIAL Aparapi Characteristics of ideal data parallel workload Code which iterates over large arrays of primitives - 32/64 bit data types preferred - where the order of iterations is not critical avoid data dependencies between iterations - each iteration contains sequential code (few branches) Balance between data size (low) and compute (high) - data transfer to/from the GPU can be costly - trivial compute not worth the transfer cost - may still benefit by freeing up CPU for other work(?)
  • 37.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL Rice University, AMD
  • 38.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL
  • 39.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL
  • 40.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL
  • 41.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL 2 six-core Intel X5660 (48 GB mem) 2 NVidia Tesla M2050 (2*2.5 GB mem) AMD A10-5800K APU (16 GB mem)
  • 42.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL 2 six-core Intel X5660 (48 GB mem) 2 NVidia Tesla M2050 (2*2.5 GB mem) AMD A10-5800K APU (16 GB mem) WHY?
  • 43.
    © ALTOROS Systems| CONFIDENTIAL HadoopCL
  • 44.
    © ALTOROS Systems| CONFIDENTIAL Back to OpenCL, Aparapi and heterogeneous computing
  • 45.
    © ALTOROS Systems| CONFIDENTIAL OpenCL, Aparapi and heterogeneous computing GPU cache GPU GDDR5 CPU cache SATA 3.0 (HDD) SATA 2.0 (SSD) 1 GBit networkFormula in terms of time: (CPU calc1) + disk read + disk write > (CPU calc2 + GPU calc + GPU-write + GPU-read) + disk read + disk write
  • 46.
    © ALTOROS Systems| CONFIDENTIAL OpenCL future
  • 47.
    © ALTOROS Systems| CONFIDENTIAL OpenCL future http://streamcomputing.eu/
  • 48.
    © ALTOROS Systems| CONFIDENTIAL Questions? Big Data Experts FB group