GPU Programming with Java
@garysieling
IQVIA
https://www.findlectures.com
Goals
• GPU landscape: use cases, devices, Java libraries
• Example use case - concept search
Device Types
• CPUs
• GPUs
• ASIC – single purpose
• FGPAs – like an ASIC, but configurable / freezable
Major use cases
• Drawing triangles in videogames
• Video encoding (e.g. ffmpeg)
• Cryptocurrency
• Deep learning
• Speech recognition
Neural Networks
Image Tagging Network (GoogLeNet)
https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
“For most of the experiments, the models were designed to keep a computational budget of 1.5 billion
multiply-adds at inference time…”
“The network is 22 layers deep when counting only layers with parameters … The overall number of layers used
for the construction of the network is about 100.”
“Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network
could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory
usage.”
Pricing
Devices you can rent - AWS
Instance
Size
GPUs -
Tesla
V100
GPU Peer
to Peer
GPU
Memory
(GB)
vCPUs
Memory
(GB)
Network
Bandwidt
h
EBS
Bandwidt
h
Price/hr*
Price/mo*
*
p3.2xlarge 1 N/A 16 8 61
Up to
10Gbps
1.5Gbps $3.06 $2,233.8
p3.8xlarge 4 NVLink 64 32 244 10Gbps 7Gbps $12.24 $8,935.2
p3.16xlarge 8 NVLink 128 64 488 25Gbps 14Gbps $24.48 $17,870.4
Motivation http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf
More examples
• “Train until network converges” (maybe never)
• ~$50k of cloud compute time to train text summarization model
• 3-6 days for AlexNet
• 7-10 days for SqueezeNet
• Deeper Convolutional Neural Networks such as
GoogLeNet can take up to 7-14 day
Source: “Deep Learning for Computer Vision”
Note: estimates for single GPU (model unclear)
Scrypt
hashcat -m 8900 -b --force -D1
• This macbook:
• Device #1: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz, skipped.*
• Device #2: Iris Pro, 384/1536 MB allocatable, 40MCU*
• Device #3: GeForce GT 750M, 512/2048 MB allocatable,
• Speed.Dev.#1: 20,295 H/s (4.86ms)
• Speed.Dev.#2: 19,078 H/s (374.43ms)
• Speed.Dev.#3: 10,302 H/s (57.01 ms)
• p3.x2large (Tesla V100)
• 1,167,600 H/s (16.91ms)
01
CUDA
• Specialized instruction set in video cards / GPUs
• Requires NVIDIA SDK and a recent card ($100-$xx,xxx)
• Or, AWS Deep Learning AMI
• SIMD
1
2
01
Specialized libraries
• OpenCL (C Library)
• cuBLAS (Matrix Algebra)
• cuRAND (Random number generation)
• cuFFT (Fourier Transform)
• nvGRAPH (Graph Analytics)
• Thrust (Collections Library)
• GRE (REST Library)
1
3
OpenCL: Example C code
__kernel void matvec(__global const float *A, __global const float *x,
uint ncols, __global float *y)
{
size_t i = get_global_id(0);
__global float const *a = &A[i*ncols];
float sum = 0.f;
for (size_t j = 0; j < ncols; j++) {
sum += a[j] * x[j];
}
y[i] = sum;
}
Source: https://en.wikipedia.org/wiki/OpenCL
CPU side
vector<float> h_A(SIZE);
vector<float> h_B(SIZE);
vector<float> h_C(SIZE); // Initialize matrices on the host
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
h_A[i*N+j] = sin(i); h_B[i*N+j] = cos(j);
}
}
Source: https://www.quantstart.com/articles/Matrix-Matrix-
Multiplication-on-the-GPU-with-Nvidia-CUDA
CPU side
// Allocate memory on the device
dev_array<float> d_A(SIZE);
dev_array<float> d_B(SIZE);
dev_array<float> d_C(SIZE);
d_A.set(&h_A[0], SIZE);
d_B.set(&h_B[0], SIZE);
matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);
Source: https://www.quantstart.com/articles/Matrix-Matrix-
Multiplication-on-the-GPU-with-Nvidia-CUDA
Available Libraries
• jOCL (“Open Compute Language”)
• Aparapi (AMD or CUDA – bytecode translation of Java)
• jCuda (C API wrapper – has bindings for several libraries)
• ND4j/ND4s (Deeplearning4j – see numpy)
jCuda
int deviceId = 0;
JCudaDriver.setExceptionsEnabled(true);
CUdevice device = new CUdevice();
cuDeviceGet(device, deviceId);
long total[] = new long[]{ 0 };
long free[] = new long[]{ 0 };
cuInit(0);
cuDeviceGet(device, deviceId);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
cuMemGetInfo(free, total);
cuCtxDestroy(context);
The other side: training Word2Vec
new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(sentenceIterator)
.tokenizerFactory(tokenizer)
.build
.fit();
INDArray
- Implementation depends on dependency:
libraryDependencies +=
"org.nd4j" % "nd4j-cuda-8.0-platform" % nd4jVersion
libraryDependencies +=
"org.nd4j" % "nd4j-native" % nd4jVersion
How do you tell if your code is running - GPU
Nd4j: INDArray
- Stored as a long array + with dimensions (like numpy)
- Create one with an iterator, data file, any shaped array, collection…
- Separate CPU/GPU implementations
Memory Structure
Goal: Compute the weighted average of the “meaning” of terms in a
document.
Terms are weighted by frequency in a document and significance (rare
words count more).
TF / DF * meaning
Words = [“the”, “cat”, “ran” “up” “the” “hill”]
Term Frequency = [2, 1, 1, 1, 2, 1 ]
Document Frequency = [1000, 7, 57, 50, 1000, 9 ]
Meanings[“the”] = [0.1, 0.04, -0.4, ….]
Meanings[“cat”] = [0.9, -0.17, -0.3, ….]
Meanings[“ran”] = [0.5, 0.1, 0.4, ….]
01
Word2Vec
2
6
01
Concept Search
• Writing, NOT Code
• Excludes “writing css”, “writing php”
• Implies "poetry", "fiction", “copyediting”
2
7
01
Concept Search Problems
• Demo
• Crawling
• Search Use Cases
• Machine Learning
2
8
• Results ”about” the chosen topic
• Determine if multiple search terms related (hiking, art)
• De-duplicating documents (e.g. announcement from different
publications)
• Higher result variety (not 5 on type systems, etc)
01
Like Lucene
• Demo
• Crawling
• Search Use Cases
• Machine Learning
2
9
• Tokenize text
• Filter entities
• Rank results, weighting by term frequency
Data Setup
// Real solution uses Lucene
List<String> sentence = Arrays.asList("the cat ran up the hill".split(" "));
List<List<String>> allSentences = new ArrayList<>();
allSentences.add(sentence);
….
Set<String> vocabulary = new HashSet<>();
vocabulary.addAll(sentence);
Term Frequency
Map<String, Long> tf =
sentence.stream()
.collect(
Collectors.groupingBy(
Function.identity(),
Collectors.counting())
);
Document Frequency
Map<String, Long> df =
allSentences.stream()
.flatMap(List::stream)
.collect(Collectors.toList())
.stream()
.collect(
Collectors.groupingBy(
Function.identity(),
Collectors.counting())
);
Goal: parallel computation (1 column / core)
the cat ran up the hill
tf 2 1 1 1 2 1
df 1000 7 57 50 1000 9
tf * idf = 2 / log(1000) …
the the the cat cat cat ...
tf 2 2 2 … 1 1 1 …
df 1000 1000 1000 … 7 7 7 …
meaning 0.1 0.04 -0.04 … 0.9 -0.17 -0.3 …
the the the cat cat cat ...
tf 2 2 2 … 1 1 1 …
df 1000 1000 1000 … 7 7 7 …
meaning 0.1 0.04 -0.04 … 0.9 -0.17 -0.3 …
tf / df *
meaning
0.0002 0.0008 -0.0008 … 0.12 -0.02 -0.04 …
Initialize TF*IDF blocks
List<INDArray> tf_list =
words.stream().map(
(word) ->
Nd4j.zeros(widthOfWordVector)
.addi(tf.get(word))
).collect(Collectors.toList());
Initialize IDF blocks
List<INDArray> df_list =
words.stream().map(
(word) ->
Nd4j.zeros(widthOfWordVector)
.addi(
Math.log(df.get(word))
)
).collect(Collectors.toList());
Initialize Word2Vec block
List<INDArray> meaning_list =
words.stream().map(
(word) ->
model.getWordVectorMatrix(word)
).collect(Collectors.toList());
Nd4j – Flatten
INDArray data =
Nd4j.vstack(
Nd4j.hstack(tf_list),
Nd4j.hstack(df_list),
Nd4j.hstack(meaning_list)
);
Nd4j – Shape
int[] shape = data.shape();
for (int i = 0; i < shape.length; i++) {
System.out.println(shape[i]);
}
3
1500
Views
INDArray tfVec = data.getRow(0);
INDArray dfVec = data.getRow(1);
INDArray scoresVec = data.getRow(2);
Multiply TF*IDF and Word2Vec data
INDArray weighted = tfVec.div(dfVec).mul(scoresVec);
shape = wordVects.shape();
for (int i = 0; i < shape.length; i++) {
System.out.println(shape[i]);
}
1
1500
Pivot
tf / df * meaning
(the)
0.0002 0.0008 -0.0008 …
tf / df * meaning
(cat)
0.12 -0.02 -0.04 …
tf / df * meaning
(ran)
… … …
tf / df * meaning
(up)
…
tf / df * meaning
(the)
0.0002 0.0008 -0.0008 …
tf / df * meaning
(cat)
0.12 -0.02 -0.04 …
tf / df * meaning
(ran)
… … …
tf / df * meaning
(up)
…
Average meaning
(document)
0.20 -0.1 -0.5
Reshape (1 row per word)
INDArray wordVects =
weighted.reshape(
vocabulary.size(),
widthOfWordVector
);
shape = wordVects.shape();
for (int i = 0; i < shape.length; i++) {
System.out.println(shape[i]);
}
5
300
Produce Weighted Average
INDArray documentAverage =
wordVects.sum(0).div(vocabulary.size());
shape = documentAverage.shape();
for (int i = 0; i < shape.length; i++) {
System.out.println(shape[i]);
}
1
300
Result
Average meaning
(document)
0.20 -0.1 -0.5 …
Average meaning
(document 1)
0.20 -0.1 -0.5 …
Average meaning
(document 2)
… … … …
…
Average meaning
(query)
-0.2 0.9 -.002 …
Cosine of angle between document, query vectors = “score”
01
Example: Similarity
Number from [0, 1]
4
9
Image credit: https://engineering.aweber.com/cosine-similarity/
Distance between “query” and “document”
import org.nd4j.linalg.ops.transforms;
double score =
Transforms.cosineSim(
documentAverage,
queryAverage
)
Other Lessons
- Tuning code requires detailed knowledge of GPU memory access
- Similar parallelism to Akka (code is faster without locking)
- Forums report max ~200W usage
- Inventing your own math does not work
- High-dimensional “objects” do not follow your intuition like 2D/3D
- Floating point math not associative
- Follow a paper
01
Resources
• “Deep Learning – A Practitioner’s Approach”
• "Relevant Search"
5
2
01
Contact:
@garysieling
gary@garysieling.com
https://www.findlectures.com
https://www.garysieling.com
5
3

Gpu programming with java

  • 1.
    GPU Programming withJava @garysieling IQVIA https://www.findlectures.com
  • 2.
    Goals • GPU landscape:use cases, devices, Java libraries • Example use case - concept search
  • 3.
    Device Types • CPUs •GPUs • ASIC – single purpose • FGPAs – like an ASIC, but configurable / freezable
  • 4.
    Major use cases •Drawing triangles in videogames • Video encoding (e.g. ffmpeg) • Cryptocurrency • Deep learning • Speech recognition
  • 5.
  • 6.
    Image Tagging Network(GoogLeNet) https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf “For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time…” “The network is 22 layers deep when counting only layers with parameters … The overall number of layers used for the construction of the network is about 100.” “Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage.”
  • 7.
  • 8.
    Devices you canrent - AWS Instance Size GPUs - Tesla V100 GPU Peer to Peer GPU Memory (GB) vCPUs Memory (GB) Network Bandwidt h EBS Bandwidt h Price/hr* Price/mo* * p3.2xlarge 1 N/A 16 8 61 Up to 10Gbps 1.5Gbps $3.06 $2,233.8 p3.8xlarge 4 NVLink 64 32 244 10Gbps 7Gbps $12.24 $8,935.2 p3.16xlarge 8 NVLink 128 64 488 25Gbps 14Gbps $24.48 $17,870.4
  • 9.
  • 10.
    More examples • “Trainuntil network converges” (maybe never) • ~$50k of cloud compute time to train text summarization model • 3-6 days for AlexNet • 7-10 days for SqueezeNet • Deeper Convolutional Neural Networks such as GoogLeNet can take up to 7-14 day Source: “Deep Learning for Computer Vision” Note: estimates for single GPU (model unclear)
  • 11.
    Scrypt hashcat -m 8900-b --force -D1 • This macbook: • Device #1: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz, skipped.* • Device #2: Iris Pro, 384/1536 MB allocatable, 40MCU* • Device #3: GeForce GT 750M, 512/2048 MB allocatable, • Speed.Dev.#1: 20,295 H/s (4.86ms) • Speed.Dev.#2: 19,078 H/s (374.43ms) • Speed.Dev.#3: 10,302 H/s (57.01 ms) • p3.x2large (Tesla V100) • 1,167,600 H/s (16.91ms)
  • 12.
    01 CUDA • Specialized instructionset in video cards / GPUs • Requires NVIDIA SDK and a recent card ($100-$xx,xxx) • Or, AWS Deep Learning AMI • SIMD 1 2
  • 13.
    01 Specialized libraries • OpenCL(C Library) • cuBLAS (Matrix Algebra) • cuRAND (Random number generation) • cuFFT (Fourier Transform) • nvGRAPH (Graph Analytics) • Thrust (Collections Library) • GRE (REST Library) 1 3
  • 14.
    OpenCL: Example Ccode __kernel void matvec(__global const float *A, __global const float *x, uint ncols, __global float *y) { size_t i = get_global_id(0); __global float const *a = &A[i*ncols]; float sum = 0.f; for (size_t j = 0; j < ncols; j++) { sum += a[j] * x[j]; } y[i] = sum; } Source: https://en.wikipedia.org/wiki/OpenCL
  • 15.
    CPU side vector<float> h_A(SIZE); vector<float>h_B(SIZE); vector<float> h_C(SIZE); // Initialize matrices on the host for (int i=0; i<N; i++){ for (int j=0; j<N; j++){ h_A[i*N+j] = sin(i); h_B[i*N+j] = cos(j); } } Source: https://www.quantstart.com/articles/Matrix-Matrix- Multiplication-on-the-GPU-with-Nvidia-CUDA
  • 16.
    CPU side // Allocatememory on the device dev_array<float> d_A(SIZE); dev_array<float> d_B(SIZE); dev_array<float> d_C(SIZE); d_A.set(&h_A[0], SIZE); d_B.set(&h_B[0], SIZE); matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N); Source: https://www.quantstart.com/articles/Matrix-Matrix- Multiplication-on-the-GPU-with-Nvidia-CUDA
  • 17.
    Available Libraries • jOCL(“Open Compute Language”) • Aparapi (AMD or CUDA – bytecode translation of Java) • jCuda (C API wrapper – has bindings for several libraries) • ND4j/ND4s (Deeplearning4j – see numpy)
  • 18.
    jCuda int deviceId =0; JCudaDriver.setExceptionsEnabled(true); CUdevice device = new CUdevice(); cuDeviceGet(device, deviceId); long total[] = new long[]{ 0 }; long free[] = new long[]{ 0 }; cuInit(0); cuDeviceGet(device, deviceId); CUcontext context = new CUcontext(); cuCtxCreate(context, 0, device); cuMemGetInfo(free, total); cuCtxDestroy(context);
  • 19.
    The other side:training Word2Vec new Word2Vec.Builder() .minWordFrequency(5) .iterations(1) .layerSize(100) .seed(42) .windowSize(5) .iterate(sentenceIterator) .tokenizerFactory(tokenizer) .build .fit();
  • 20.
    INDArray - Implementation dependson dependency: libraryDependencies += "org.nd4j" % "nd4j-cuda-8.0-platform" % nd4jVersion libraryDependencies += "org.nd4j" % "nd4j-native" % nd4jVersion
  • 21.
    How do youtell if your code is running - GPU
  • 22.
    Nd4j: INDArray - Storedas a long array + with dimensions (like numpy) - Create one with an iterator, data file, any shaped array, collection… - Separate CPU/GPU implementations
  • 23.
  • 24.
    Goal: Compute theweighted average of the “meaning” of terms in a document. Terms are weighted by frequency in a document and significance (rare words count more).
  • 25.
    TF / DF* meaning Words = [“the”, “cat”, “ran” “up” “the” “hill”] Term Frequency = [2, 1, 1, 1, 2, 1 ] Document Frequency = [1000, 7, 57, 50, 1000, 9 ] Meanings[“the”] = [0.1, 0.04, -0.4, ….] Meanings[“cat”] = [0.9, -0.17, -0.3, ….] Meanings[“ran”] = [0.5, 0.1, 0.4, ….]
  • 26.
  • 27.
    01 Concept Search • Writing,NOT Code • Excludes “writing css”, “writing php” • Implies "poetry", "fiction", “copyediting” 2 7
  • 28.
    01 Concept Search Problems •Demo • Crawling • Search Use Cases • Machine Learning 2 8 • Results ”about” the chosen topic • Determine if multiple search terms related (hiking, art) • De-duplicating documents (e.g. announcement from different publications) • Higher result variety (not 5 on type systems, etc)
  • 29.
    01 Like Lucene • Demo •Crawling • Search Use Cases • Machine Learning 2 9 • Tokenize text • Filter entities • Rank results, weighting by term frequency
  • 30.
    Data Setup // Realsolution uses Lucene List<String> sentence = Arrays.asList("the cat ran up the hill".split(" ")); List<List<String>> allSentences = new ArrayList<>(); allSentences.add(sentence); …. Set<String> vocabulary = new HashSet<>(); vocabulary.addAll(sentence);
  • 31.
    Term Frequency Map<String, Long>tf = sentence.stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()) );
  • 32.
    Document Frequency Map<String, Long>df = allSentences.stream() .flatMap(List::stream) .collect(Collectors.toList()) .stream() .collect( Collectors.groupingBy( Function.identity(), Collectors.counting()) );
  • 33.
    Goal: parallel computation(1 column / core) the cat ran up the hill tf 2 1 1 1 2 1 df 1000 7 57 50 1000 9 tf * idf = 2 / log(1000) …
  • 34.
    the the thecat cat cat ... tf 2 2 2 … 1 1 1 … df 1000 1000 1000 … 7 7 7 … meaning 0.1 0.04 -0.04 … 0.9 -0.17 -0.3 …
  • 35.
    the the thecat cat cat ... tf 2 2 2 … 1 1 1 … df 1000 1000 1000 … 7 7 7 … meaning 0.1 0.04 -0.04 … 0.9 -0.17 -0.3 … tf / df * meaning 0.0002 0.0008 -0.0008 … 0.12 -0.02 -0.04 …
  • 36.
    Initialize TF*IDF blocks List<INDArray>tf_list = words.stream().map( (word) -> Nd4j.zeros(widthOfWordVector) .addi(tf.get(word)) ).collect(Collectors.toList());
  • 37.
    Initialize IDF blocks List<INDArray>df_list = words.stream().map( (word) -> Nd4j.zeros(widthOfWordVector) .addi( Math.log(df.get(word)) ) ).collect(Collectors.toList());
  • 38.
    Initialize Word2Vec block List<INDArray>meaning_list = words.stream().map( (word) -> model.getWordVectorMatrix(word) ).collect(Collectors.toList());
  • 39.
    Nd4j – Flatten INDArraydata = Nd4j.vstack( Nd4j.hstack(tf_list), Nd4j.hstack(df_list), Nd4j.hstack(meaning_list) );
  • 40.
    Nd4j – Shape int[]shape = data.shape(); for (int i = 0; i < shape.length; i++) { System.out.println(shape[i]); } 3 1500
  • 41.
    Views INDArray tfVec =data.getRow(0); INDArray dfVec = data.getRow(1); INDArray scoresVec = data.getRow(2);
  • 42.
    Multiply TF*IDF andWord2Vec data INDArray weighted = tfVec.div(dfVec).mul(scoresVec); shape = wordVects.shape(); for (int i = 0; i < shape.length; i++) { System.out.println(shape[i]); } 1 1500
  • 43.
    Pivot tf / df* meaning (the) 0.0002 0.0008 -0.0008 … tf / df * meaning (cat) 0.12 -0.02 -0.04 … tf / df * meaning (ran) … … … tf / df * meaning (up) …
  • 44.
    tf / df* meaning (the) 0.0002 0.0008 -0.0008 … tf / df * meaning (cat) 0.12 -0.02 -0.04 … tf / df * meaning (ran) … … … tf / df * meaning (up) … Average meaning (document) 0.20 -0.1 -0.5
  • 45.
    Reshape (1 rowper word) INDArray wordVects = weighted.reshape( vocabulary.size(), widthOfWordVector ); shape = wordVects.shape(); for (int i = 0; i < shape.length; i++) { System.out.println(shape[i]); } 5 300
  • 46.
    Produce Weighted Average INDArraydocumentAverage = wordVects.sum(0).div(vocabulary.size()); shape = documentAverage.shape(); for (int i = 0; i < shape.length; i++) { System.out.println(shape[i]); } 1 300
  • 47.
  • 48.
    Average meaning (document 1) 0.20-0.1 -0.5 … Average meaning (document 2) … … … … … Average meaning (query) -0.2 0.9 -.002 … Cosine of angle between document, query vectors = “score”
  • 49.
    01 Example: Similarity Number from[0, 1] 4 9 Image credit: https://engineering.aweber.com/cosine-similarity/
  • 50.
    Distance between “query”and “document” import org.nd4j.linalg.ops.transforms; double score = Transforms.cosineSim( documentAverage, queryAverage )
  • 51.
    Other Lessons - Tuningcode requires detailed knowledge of GPU memory access - Similar parallelism to Akka (code is faster without locking) - Forums report max ~200W usage - Inventing your own math does not work - High-dimensional “objects” do not follow your intuition like 2D/3D - Floating point math not associative - Follow a paper
  • 52.
    01 Resources • “Deep Learning– A Practitioner’s Approach” • "Relevant Search" 5 2
  • 53.

Editor's Notes

  • #13 If you're interested in these topics, here are some useful resources.
  • #14 If you're interested in these topics, here are some useful resources.
  • #27 The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #28 If we query for a topic instead, we also get appropriate results.
  • #29 The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #30 The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #50 The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #53 If you're interested in these topics, here are some useful resources.
  • #54 We're Hiring