Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

584 views

Published on

12th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2015.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

  1. 1. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection Akihiro Hayashi (Rice University) Kazuaki Ishizaki (IBM Research - Tokyo) Gita Koblents (IBM Canada) Vivek Sarkar(Rice University) 1 ACM International Conference on Principles and Practices of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)
  2. 2. Background: Java 8 Parallel Streams APIs Explicit Parallelism with lambda expressions 2 IntStream.range(0, N) .parallel() .forEach(i -> <lambda>)
  3. 3. Background: Explicit Parallelism with Java High-level parallel programming with Java offers opportunities for preserving portability enabling compiler to perform parallel- aware optimizations and code generation 3 Java 8 Parallel Stream API Multi- core CPUs Many- core GPUs FPGAs Java 8 Programs HW SW
  4. 4. Background: JIT Compilation for GPU Execution  IBM Java 8 Compiler Built on top of the production version of the IBM Java 8 runtime environment 4 Multi- core CPUs Many- core GPUs method A method A method A method A Interpretation on JVM 1st invocation 2nd invocation Nth invocation (N+1)th invocation Native Code Generation for Multi- core CPUs
  5. 5. Background: The compilation flow of IBM Java 8 Compiler 5 Java bytecode parallel streams identification in IR Target machine code generation PTX2binary module NVIDIA GPU native code PowerPC native code libnvvm Our IR PTX JIT compiler Module by NVIDIA Analysis and optimizations Our new modules for GPU IR for parallel streams NVVM IR generation NVVM IR Existing optimizations Bytecode translation Runtime Helpers GPU Feature extraction Optimizations for GPUs To improve performance Read Only Cache Utilization / Buffer Alignment Data Transfer Optimizations To support Java’s language features Loop versioning for eliminating redundant exception checking on GPUs Virtual method invocation support with de-virtualization and loop versioning
  6. 6. Motivation: Runtime CPU/GPU Selection Selecting a faster hardware device is a challenging problem 6 Multi- core CPUs Many- core GPUs method A method A Nth invocation (N+1)th invocation Native Code Generation for (Problem) Which one is faster?
  7. 7. 0e+00 4e+07 8e+07 0.00.20.40.60.81.0 The dynamic number of IR instructions KernelExecutionTime(msec) NVIDIA Tesla K40 GPU IBM POWER8 CPU Related Work: Linear regression Regression based cost estimation [1,2] is specific to an application 7 App 1) BlackScholes [1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09) [2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3) ExecutionTime(msec) 0e+00 4e+07 8e+07 01234 The dynamic number of IR instructions KernelExecutionTime(msec) NVIDIA Tesla K40 GPU IBM POWER8 CPU ExecutionTime(msec) App 2) Vector Addition CPU GPU GPU CPU
  8. 8. Open Question: Accurate cost model is required? Accurate cost model construction would be too much Considerable effort will be needed to update performance models for future generations of hardware 8 Multi- core CPUs Many- core GPUs Which one is faster? Machine-learning-based performance heuristics
  9. 9. Our Approach: ML-based Performance Heuristics A binary prediction model is constructed by supervised machine learning with support vector machines (SVMs) 9 bytecode App A Prediction Model JIT compiler feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 LIBSVM Training run with JIT Compiler Offline Model Construction feature extraction feature extraction feature extraction Java Runtime CPU GPU
  10. 10. Features of program that may affect performance Loop Range (Parallel Loop Size) The dynamic number of Instructions Memory Access Arithmetic operations Math Methods Branch Instructions Other Instructions 10
  11. 11. Features of program that may affect performance (Cont’d) The dynamic number of Array Accesses Coalesced Access (a[i]) (aligned access) Offset Access (a[i+c]) Stride Access (a[c*i]) Other Access (a[b[i]]) Data Transfer Size H2D Transfer Size D2H Transfer Size 11 0 5 10 15 20 25 30 050100150200 c : offset or stride size Bandwidth(GB/s) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Offset : array[i+c] Stride : array[c*i]
  12. 12. An example of feature vector 12 { "title" :"MRIQ.runGPULambda()V", "lineNo" :114, "features" :{ "range": 32768, "IRs" :{ "Memory": 89128, "Arithmetic": 61447, "Math": 6144, "Branch": 3074, "Other": 58384 }, "Array Accesses" :{ "Coalesced": 9218, "Offset": 0, "Stride": 0, "Other": 12288 }, "H2D Transfer" : [131088,131088,131088,12304,12304,12304,12304,16,16], "D2H Transfer" : [131072,131072,0,0,0,0,0,0,0,0] }, }
  13. 13. Applications 13 Application Source Field Max Size Data Type BlackScholes Finance 4,194,304 double Crypt JGF Cryptography Size C (N=50M) byte SpMM JGF Numerical Computing Size C (N=500K) double MRIQ Parboil Medical Large (64^3) float Gemm Polybench Numerical Computing 2K x 2K int Gesummv Polybench 2K x 2K int Doitgen Polybench 256x256x256 int Jacobi-1D Polybench N=4M, T=1 int Matrix Multiplication 2K x 2K double Matrix Transpose 2K x 2K double VecAdd 4M double
  14. 14. Platform CPU IBM POWER8 @ 3.69GHz 20-cores 8 SMT threads per cores = up to 160 threads 256 GB of RAM GPU NVIDIA Tesla K40m 12GB of Global Memory 14
  15. 15. Prediction Model Construction Obtained 291 samples by running 11 applications with different data sets Choice is either GPU or 160 worker threads on CPU 15 bytecode App A Prediction Model feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 LIBSVM 3.2 Training run with JIT Compiler Offline Model Construction feature extraction feature extraction feature extraction Java Runtime
  16. 16. Speedups and the accuracy with the max data size: 160 worker threads vs. GPU 16 40.6 37.4 82.0 64.2 27.6 1.4 1.0 4.4 36.7 7.4 5.7 42.7 34.6 58.1 844.7 772.3 1.0 0.1 1.9 1164.8 9.0 1.2 0.0 0.1 1.0 10.0 100.0 1000.0 10000.0 SpeeduprelativetoSEQENTIALJava (logscale) Higher is better160 worker threads (Fork/join) GPU Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
  17. 17. How to evaluate prediction models: TRUE/FALSE POSITIVE/NEGATIVE 4 types of binary prediction results (Let POSITIVE is CPU, NEGATIVE is GPU) TRUE POSITIVE Correctly predicted that CPU is faster TRUE NEGATIVE Correctly predicted that GPU is faster FALSE POSITIVE Predicted CPU is faster, but GPU is actually faster FALSE NEGATIVE Predicted GPU is faster, but CPU is actually faster 17
  18. 18. How to evaluate prediction models: Accuracy, Precision, and Recall metrics Accuracy  the percentage of selections predicted correctly:  (TP + TN) / (TP + TN + FP + FN) Precision X (X = CPU or GPU)  Precision CPU : TP / (TP + FP) # of samples correctly predicted CPU is faster / total # of samples predicted that CPU is faster  Precision GPU : TN / (TN + FN) Recall X (X = CPU or GPU)  Recall CPU : TP / (TP + FN) # of samples correctly predicted CPU is faster / total # of samples labeled that CPU is actually faster  Recall GPU : TN / (TN + FP) 18 How precise is the model when it predicts X is faster. How does the prediction hit the nail when X is actually faster.
  19. 19. How to evaluate prediction models: 5-fold cross validation Overfitting problem:  Prediction model may be tailored to the eleven applications if training data = testing data To avoid the overfitting problem:  Calculate the accuracy of the prediction model using 5-fold cross validation 19 Subset 1 Subset 2 Subset 3 Subset 4 Subset 5 Subset 2Subset 1 Subset 3 Subset 4 Subset 5 Build a prediction model trained on Subset 2-5 Used for TESTING Accuracy : X%, Precision : Y%,... Used for TESTING Accuracy : P%, Precision Q%, … TRAINING DATA (291 samples) Build a prediction model trained on Subset 1, 3-5
  20. 20. 79.0% 97.6% 99.0% 99.0% 97.2% 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)" Accuracy (%), Total number of samples = 291 Accuracies with cross-validation: 160 worker threads or GPU 20Higher is better
  21. 21. Precisions and Recalls with cross-validation 21 Precision CPU Recall CPU Precision GPU Recall GPU Range 79.0% 100% 0% 0% +=nIRs 97.8% 99.1% 96.5% 91.8% +=dIRs 98.7% 100% 100% 95.0% +=Arrays 98.7% 100% 100% 95.0% ALL 96.7% 100% 100% 86.9% Higher is better All prediction models except Range rarely make a bad decision
  22. 22. Discussion  Based on results with 291 samples, (range, # of detailed Insns, # of array accesses) shows the best accuracy  DT does not contribute to improving the accuracy since the DT optimizations do not make GPUs faster  Pros. and Cons.  (+) Future generations of hardware can be supported easily by re-running applications  (+) Just add another training data to rebuild a prediction model  (-) Takes time to collect training data  Loop Range, # of arithmetic, and # of coalesced accesses affects the decision 22
  23. 23. Related Work: Java + GPU 23 Lang JIT GPU Kernel Device Selection JCUDA Java - CUDA GPU only Lime Lime ✔ Override map/reduce Static Firepile Scala ✔ reduce Static JaBEE Java ✔ Override run GPU only Aparapi Java ✔ map Static Hadoop-CL Java ✔ Override map/reduce Static RootBeer Java ✔ Override run Not Described HJ-OpenCL HJ - forall Static PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression Our Work Java ✔ Parallel Stream Dynamic with Machine Learning None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning
  24. 24. Conclusions Machine-learning based Performance Heuristics  Up to 99% accuracy  Promising way to build performance heuristics Future Work  Exploration of features of program (e.g. CFG)  Selection of the best configuration from 1 worker, 2 workers ,… 160 workers, GPU  Parallelizing Training Phase For more details on GPU code generations  “Compiling and Optimizing Java 8 Programs for GPU execution”, PACT15, October 2015 24
  25. 25. Acknowledgements Special thanks to IBM CAS Marcel Mitran (IBM Canada) Jimmy Kwa (IBM Canada) Habanero Group at Rice Yoichi Matsuyama (CMU) 25

×