Gpu programming with java

GPU Programming with Java
@garysieling
IQVIA
https://www.findlectures.com

Goals
• GPU landscape: use cases, devices, Java libraries
• Example use case - concept search

Device Types
• CPUs
• GPUs
• ASIC – single purpose
• FGPAs – like an ASIC, but configurable / freezable

Major use cases
• Drawing triangles in videogames
• Video encoding (e.g. ffmpeg)
• Cryptocurrency
• Deep learning
• Speech recognition

Image Tagging Network (GoogLeNet)
https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
“For most of the experiments, the models were designed to keep a computational budget of 1.5 billion
multiply-adds at inference time…”
“The network is 22 layers deep when counting only layers with parameters … The overall number of layers used
for the construction of the network is about 100.”
“Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network
could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory
usage.”

Devices you can rent - AWS
Instance
Size
GPUs -
Tesla
V100
GPU Peer
to Peer
GPU
Memory
(GB)
vCPUs
Memory
(GB)
Network
Bandwidt
h
EBS
Bandwidt
h
Price/hr*
Price/mo*
*
p3.2xlarge 1 N/A 16 8 61
Up to
10Gbps
1.5Gbps $3.06 $2,233.8
p3.8xlarge 4 NVLink 64 32 244 10Gbps 7Gbps $12.24 $8,935.2
p3.16xlarge 8 NVLink 128 64 488 25Gbps 14Gbps $24.48 $17,870.4

Motivation http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf

More examples
• “Train until network converges” (maybe never)
• ~$50k of cloud compute time to train text summarization model
• 3-6 days for AlexNet
• 7-10 days for SqueezeNet
• Deeper Convolutional Neural Networks such as
GoogLeNet can take up to 7-14 day
Source: “Deep Learning for Computer Vision”
Note: estimates for single GPU (model unclear)

Scrypt
hashcat -m 8900 -b --force -D1
• This macbook:
• Device #1: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz, skipped.*
• Device #2: Iris Pro, 384/1536 MB allocatable, 40MCU*
• Device #3: GeForce GT 750M, 512/2048 MB allocatable,
• Speed.Dev.#1: 20,295 H/s (4.86ms)
• Speed.Dev.#2: 19,078 H/s (374.43ms)
• Speed.Dev.#3: 10,302 H/s (57.01 ms)
• p3.x2large (Tesla V100)
• 1,167,600 H/s (16.91ms)

01
CUDA
• Specialized instruction set in video cards / GPUs
• Requires NVIDIA SDK and a recent card ($100-$xx,xxx)
• Or, AWS Deep Learning AMI
• SIMD
1
2

01
Specialized libraries
• OpenCL (C Library)
• cuBLAS (Matrix Algebra)
• cuRAND (Random number generation)
• cuFFT (Fourier Transform)
• nvGRAPH (Graph Analytics)
• Thrust (Collections Library)
• GRE (REST Library)
1
3

OpenCL: Example C code
__kernel void matvec(__global const float *A, __global const float *x,
uint ncols, __global float *y)
{
size_t i = get_global_id(0);
__global float const *a = &A[i*ncols];
float sum = 0.f;
for (size_t j = 0; j < ncols; j++) {
sum += a[j] * x[j];
}
y[i] = sum;
}
Source: https://en.wikipedia.org/wiki/OpenCL

CPU side
vector<float> h_A(SIZE);
vector<float> h_B(SIZE);
vector<float> h_C(SIZE); // Initialize matrices on the host
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
h_A[i*N+j] = sin(i); h_B[i*N+j] = cos(j);
}
}
Source: https://www.quantstart.com/articles/Matrix-Matrix-
Multiplication-on-the-GPU-with-Nvidia-CUDA

CPU side
// Allocate memory on the device
dev_array<float> d_A(SIZE);
dev_array<float> d_B(SIZE);
dev_array<float> d_C(SIZE);
d_A.set(&h_A[0], SIZE);
d_B.set(&h_B[0], SIZE);
matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);
Source: https://www.quantstart.com/articles/Matrix-Matrix-
Multiplication-on-the-GPU-with-Nvidia-CUDA

Available Libraries
• jOCL (“Open Compute Language”)
• Aparapi (AMD or CUDA – bytecode translation of Java)
• jCuda (C API wrapper – has bindings for several libraries)
• ND4j/ND4s (Deeplearning4j – see numpy)

jCuda
int deviceId = 0;
JCudaDriver.setExceptionsEnabled(true);
CUdevice device = new CUdevice();
cuDeviceGet(device, deviceId);
long total[] = new long[]{ 0 };
long free[] = new long[]{ 0 };
cuInit(0);
cuDeviceGet(device, deviceId);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
cuMemGetInfo(free, total);
cuCtxDestroy(context);

The other side: training Word2Vec
new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(sentenceIterator)
.tokenizerFactory(tokenizer)
.build
.fit();

INDArray
- Implementation depends on dependency:
libraryDependencies +=
"org.nd4j" % "nd4j-cuda-8.0-platform" % nd4jVersion
libraryDependencies +=
"org.nd4j" % "nd4j-native" % nd4jVersion

How do you tell if your code is running - GPU

Nd4j: INDArray
- Stored as a long array + with dimensions (like numpy)
- Create one with an iterator, data file, any shaped array, collection…
- Separate CPU/GPU implementations

Goal: Compute the weighted average of the “meaning” of terms in a
document.
Terms are weighted by frequency in a document and significance (rare
words count more).

TF / DF * meaning
Words = [“the”, “cat”, “ran” “up” “the” “hill”]
Term Frequency = [2, 1, 1, 1, 2, 1 ]
Document Frequency = [1000, 7, 57, 50, 1000, 9 ]
Meanings[“the”] = [0.1, 0.04, -0.4, ….]
Meanings[“cat”] = [0.9, -0.17, -0.3, ….]
Meanings[“ran”] = [0.5, 0.1, 0.4, ….]

01
Concept Search
• Writing, NOT Code
• Excludes “writing css”, “writing php”
• Implies "poetry", "fiction", “copyediting”
2
7

01
Concept Search Problems
• Demo
• Crawling
• Search Use Cases
• Machine Learning
2
8
• Results ”about” the chosen topic
• Determine if multiple search terms related (hiking, art)
• De-duplicating documents (e.g. announcement from different
publications)
• Higher result variety (not 5 on type systems, etc)

01
Like Lucene
• Demo
• Crawling
• Search Use Cases
• Machine Learning
2
9
• Tokenize text
• Filter entities
• Rank results, weighting by term frequency

Data Setup
// Real solution uses Lucene
List<String> sentence = Arrays.asList("the cat ran up the hill".split(" "));
List<List<String>> allSentences = new ArrayList<>();
allSentences.add(sentence);
….
Set<String> vocabulary = new HashSet<>();
vocabulary.addAll(sentence);

Term Frequency
Map<String, Long> tf =
sentence.stream()
.collect(
Collectors.groupingBy(
Function.identity(),
Collectors.counting())
);

Document Frequency
Map<String, Long> df =
allSentences.stream()
.flatMap(List::stream)
.collect(Collectors.toList())
.stream()
.collect(
Collectors.groupingBy(
Function.identity(),
Collectors.counting())
);

Goal: parallel computation (1 column / core)
the cat ran up the hill
tf 2 1 1 1 2 1
df 1000 7 57 50 1000 9
tf * idf = 2 / log(1000) …

the the the cat cat cat ...
tf 2 2 2 … 1 1 1 …
df 1000 1000 1000 … 7 7 7 …
meaning 0.1 0.04 -0.04 … 0.9 -0.17 -0.3 …

the the the cat cat cat ...
tf 2 2 2 … 1 1 1 …
df 1000 1000 1000 … 7 7 7 …
meaning 0.1 0.04 -0.04 … 0.9 -0.17 -0.3 …
tf / df *
meaning
0.0002 0.0008 -0.0008 … 0.12 -0.02 -0.04 …

Initialize TF*IDF blocks
List<INDArray> tf_list =
words.stream().map(
(word) ->
Nd4j.zeros(widthOfWordVector)
.addi(tf.get(word))
).collect(Collectors.toList());

Initialize IDF blocks
List<INDArray> df_list =
words.stream().map(
(word) ->
Nd4j.zeros(widthOfWordVector)
.addi(
Math.log(df.get(word))
)

Initialize Word2Vec block
List<INDArray> meaning_list =
words.stream().map(
(word) ->
model.getWordVectorMatrix(word)

Nd4j – Flatten
INDArray data =
Nd4j.vstack(
Nd4j.hstack(tf_list),
Nd4j.hstack(df_list),
Nd4j.hstack(meaning_list)
);

Nd4j – Shape
int[] shape = data.shape();
for (int i = 0; i < shape.length; i++) {
System.out.println(shape[i]);
}
3
1500

Views
INDArray tfVec = data.getRow(0);
INDArray dfVec = data.getRow(1);
INDArray scoresVec = data.getRow(2);

Multiply TF*IDF and Word2Vec data
INDArray weighted = tfVec.div(dfVec).mul(scoresVec);
shape = wordVects.shape();
}
1
1500

Pivot
tf / df * meaning
(the)
0.0002 0.0008 -0.0008 …
tf / df * meaning
(cat)
0.12 -0.02 -0.04 …
tf / df * meaning
(ran)
… … …
tf / df * meaning
(up)
…

tf / df * meaning
(the)
0.0002 0.0008 -0.0008 …
tf / df * meaning
(cat)
0.12 -0.02 -0.04 …
tf / df * meaning
(ran)
… … …
tf / df * meaning
(up)
…
Average meaning
(document)
0.20 -0.1 -0.5

Reshape (1 row per word)
INDArray wordVects =
weighted.reshape(
vocabulary.size(),
widthOfWordVector
);
shape = wordVects.shape();
}
5
300

Produce Weighted Average
INDArray documentAverage =
wordVects.sum(0).div(vocabulary.size());
shape = documentAverage.shape();
}
1
300

Result
Average meaning
(document)
0.20 -0.1 -0.5 …

Average meaning
(document 1)
0.20 -0.1 -0.5 …
Average meaning
(document 2)
… … … …
…
Average meaning
(query)
-0.2 0.9 -.002 …
Cosine of angle between document, query vectors = “score”

01
Example: Similarity
Number from [0, 1]
4
9
Image credit: https://engineering.aweber.com/cosine-similarity/

Distance between “query” and “document”
import org.nd4j.linalg.ops.transforms;
double score =
Transforms.cosineSim(
documentAverage,
queryAverage
)

Other Lessons
- Tuning code requires detailed knowledge of GPU memory access
- Similar parallelism to Akka (code is faster without locking)
- Forums report max ~200W usage
- Inventing your own math does not work
- High-dimensional “objects” do not follow your intuition like 2D/3D
- Floating point math not associative
- Follow a paper

01
Resources
• “Deep Learning – A Practitioner’s Approach”
• "Relevant Search"
5
2

01
Contact:
@garysieling
gary@garysieling.com
https://www.findlectures.com
https://www.garysieling.com
5
3

Gpu programming with java

More Related Content

What's hot

Similar to Gpu programming with java

More from Gary Sieling

Recently uploaded

Gpu programming with java

Editor's Notes