Presentation delivered on 21-Feb-2019 about SIMD (vectorial) operations in java, using as an example a CNN (Convolutional Neural Network) and the requirement in Deep Learning to process massive arrays and matrices. We provide some insight and some solutions of how to perform that in Java using DeepLearning4J, after investigating CERN's Colt Java Library.
16. SIMD
How much faster?
Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SouJava & Belfast JUG
17. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
EXAMPLE WITH NUMPY
elapsedinms
0
125
250
375
500
500 X 5 5000 X 5 50000 X 5
SiSD SiMD
SouJava & Belfast JUG
18. WELL, I DON’T DO AI OR NNS.
DOES IT MATTER TO ME?
WHAT ARE IMAGES FOR
NEURAL NETS / AI?
WHY DOES IT MATTER
FOR US / FOR JAVA?
CERN’S COLT
JNI AND PANAMA
19. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SIMD
Single “Instruction"?
SouJava & Belfast JUG
20. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SIMD
Single “Instruction"?
Add, Subtract, Multiply, Divide
but also Log, Exp, Sqrt, etc
SouJava & Belfast JUG
21. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SIMD
Multiple “Data"?
SouJava & Belfast JUG
22. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SIMD
Multiple “Data”?
Numerical data:
Int, Long, Float, Double, etc
SouJava & Belfast JUG
23. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SISD IMPLEMENTATIONS: SUM STREAM OF DOUBLES
SouJava & Belfast JUG
24. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SISD IMPLEMENTATIONS: ADDING ELEMENTS OF 2 ARRAYS
SouJava & Belfast JUG
25. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SISD IMPLEMENTATIONS: EUCLIDIAN DISTANCE BTW DOUBLE IN 2 ARRAYS
SouJava & Belfast JUG
26. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SISD IMPLEMENTATIONS: LOGS OF MATRICES
SouJava & Belfast JUG
27. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SISD (OR SINGLE INSTRUCTION SINGLE DATA) IN BYTECODE
SouJava & Belfast JUG
28. Hudson Mendes
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
SISD (OR SINGLE INSTRUCTION SINGLE DATA) IN BYTECODE
▸ Given Vector of size N
▸ Each 1 N (N[i]) receives 1 instruction (Single Instruction, Single Data)
▸ SiSd O(n) = 2n x > SiMd O(n) = n
elapsedinms
0
250
500
500 X 5 1000 X 5 5000 X 5 10000 X 5 50000 X 5 100000 X 5
SiSD SiMD
SouJava & Belfast JUG
29. HOW TO DO SIMD
IN JAVA THEN?
WHAT ARE IMAGES FOR
NEURAL NETS / AI?
WHY DOES IT MATTER
FOR US / FOR JAVA?
CERN’S COLT
JNI AND PANAMA
30. LET’S ASK THE DATA SCIENTISTS?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesBelfast JUG
32. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
1 GFLOP = 1.000.000.000 float point operations per second
SouJava & Belfast JUG
33. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
1 GFLOP = 1.000.000.000 float point operations per second
That looks pretty fast!
SouJava & Belfast JUG
34. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
1 GFLOP = 1.000.000.000 float point operations per second
That looks pretty fast!
CERN’s Colt, Java SIMD Library?
SouJava & Belfast JUG
35. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesSouJava & Belfast JUG
36. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesSouJava & Belfast JUG
37. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Not Faster than Serial, but WHY?
SouJava & Belfast JUG
38. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesSouJava & Belfast JUG
39. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Not SIMD!
Digging Source Code:
SouJava & Belfast JUG
40. CLASSIC SCIENTIFIC COMPUTING: CERN’S COLT
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
1 GFLOP = 1.000.000.000 float point operations per second
Fast, but not SIMD
CERN’s Colt, Java SIMD Library?
SouJava & Belfast JUG
42. DEEP LEARNING DATA SCIENTIST: DEEPLEARNING4J
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesSouJava & Belfast JUG
43. DEEP LEARNING DATA SCIENTIST: DEEPLEARNING4J
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesSouJava & Belfast JUG
44. DEEP LEARNING DATA SCIENTIST: DEEPLEARNING4J
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Faster than all of THEM!
SouJava & Belfast JUG
45. DEEP LEARNING DATA SCIENTIST: DEEPLEARNING4J
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Digging Source Code:
SouJava & Belfast JUG
46. DEEP LEARNING DATA SCIENTIST: DEEPLEARNING4J
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Digging Source Code:
SouJava & Belfast JUG
47. DEEP LEARNING DATA SCIENTIST: DEEPLEARNING4J
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Digging Source Code:
Yes, SIMD! done with JNI
SouJava & Belfast JUG
48. SO, IS JNI (JAVA NATIVE INTERFACE)
THE ONLY WAY TO SIMDS?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesSouJava & Belfast JUG
49. WAYS TO DO SIMD?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
1 + 2 = 3
iconst_1
iconst_2
iadd
MOV AX,@DATA
MOV DS,AX
MOV AX,OPR1
MOV BX,OPR2
CLC
ADD AX,BX
MOV DI,OFFSET RESULT
MOV [DI], AX
MOV AH,09H
MOV DX,OFFSET RESULT
INT 21H
MOV AH,4CH
INT 21H
END
JAVA
BYTECODE
ASSEMBLER
Hudson Mendes
50. WAYS TO DO SIMD?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
1 + 2 = 3
iconst_1
iconst_2
iadd
MOV AX,@DATA
MOV DS,AX
MOV AX,OPR1
MOV BX,OPR2
CLC
ADD AX,BX
MOV DI,OFFSET RESULT
MOV [DI], AX
MOV AH,09H
MOV DX,OFFSET RESULT
INT 21H
MOV AH,4CH
INT 21H
END
VECTOR.SUM() ivector_1
vector_add
EXPORT XCORR_KERNEL
xcorr_kernel PROC
VMOV.I32 q0, #0
CMP r3, #0
BLE xcorr_kernel_done
VLD1.16 {d3}, [r2]!
SUBS r3, r3, #4
BLE xcorr_kernel_process4_done
(…)
JAVA
BYTECODE
ASSEMBLER
JAVA
BYTECODE
ASSEMBLER
SouJava & Belfast JUG Hudson Mendes
51. WAYS TO DO SIMD?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
VECTOR.SUM() ivector_1
vector_add
EXPORT XCORR_KERNEL
xcorr_kernel PROC
VMOV.I32 q0, #0
CMP r3, #0
BLE xcorr_kernel_done
VLD1.16 {d3}, [r2]!
SUBS r3, r3, #4
BLE xcorr_kernel_process4_done
(…)
JAVA
BYTECODE
ASSEMBLER
SO, YES - AT THE MINUTE
JNI IS PRETTY MUCH THE ONLY WAY…
SouJava & Belfast JUG Hudson Mendes
52. IS THERE ANYTHING BETTER
THAN JNI IN THE NEAR FUTURE?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesBelfast JUG
53. IS THERE ANYTHING BETTER
THAN JNI IN THE NEAR FUTURE?
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson MendesBelfast JUG
YES!
54. JAVA9+ SUPERWORD
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Source http://prestodb.rocks/code/simd/
SouJava & Belfast JUG
55. JAVA9+ SUPERWORD
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Source http://prestodb.rocks/code/simd/
"Exploiting Superword Level Parallelism with Multimedia Instruction
Sets", LARSEN Samuel and AMARASINGHE Saman, from MIT
HTTP://GROUPS.CSAIL.MIT.EDU/CAG/SLP/SLP-PLDI-2000.PDF
SouJava & Belfast JUG
56. JAVA9+ SUPERWORD
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Source http://prestodb.rocks/code/simd/
FOR SUPERWORD, MUST NOT HAVE:
•AN OR CONDITION AS THE LOOP CONDITION
•A NON-INLINED METHOD INSIDE THE LOOP
•AN ARBITRARY METHOD AS THE LOOP CONDITION
•MANUALLY UNROLLING OF THE LOOP
•A LONG AS THE LOOP VARIABLE
•MULTIPLE EXIT POINTS
SouJava & Belfast JUG
57. JAVA9+ SUPERWORD
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Source http://prestodb.rocks/code/simd/
ON BY DEFAULT ON J9+
FOR SUPERWORD, MUST NOT HAVE:
•AN OR CONDITION AS THE LOOP CONDITION
•A NON-INLINED METHOD INSIDE THE LOOP
•AN ARBITRARY METHOD AS THE LOOP CONDITION
•MANUALLY UNROLLING OF THE LOOP
•A LONG AS THE LOOP VARIABLE
•MULTIPLE EXIT POINTS
SouJava & Belfast JUG
58. PROJECT PANAMA
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Source http://openjdk.java.net/projects/panama/
SouJava & Belfast JUG
59. PROJECT PANAMA
SIMD, SINGLE INSTRUCTION MULTIPLE DATA (VECTORISATION)
Hudson Mendes
Source http://openjdk.java.net/projects/panama/
BETTER NATIVE API THAN JNI
SouJava & Belfast JUG
60. Hudson Mendes
Lead Java Software Engineer @ AIQUDO
twitter.com/hudsonmendes
linkedin.com/in/hudsonmendes
medium.com/@hudsonmendes
THANK YOU!
JMH CODE AVAILABLE AT
HTTPS://GITHUB.COM/HUDSONMENDES/BELFASTJUG-SAMPLE-3
SOUJAVA & BELFASTJUG