Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.

149 views

Published on

[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.
H2O.ai Mateusz Dymczyk 氏

Published in: Technology
  • Be the first to comment

  • Be the first to like this

[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.

  1. 1. H2O4GPU and GoAI: harnessing the power of GPUs. Mateusz Dymczyk Senior Software Engineer H2O.ai @mdymczyk
  2. 2. Agenda • About me • About H2O.ai • A bit of history: H2O-3 • Moving forward: feature engineering & Driverless AI • The need for GPUs • GPU overview • Machine Learning + GPUs = why? how? • About GoAI • About H2O4GPU • Q&A
  3. 3. About me • M.Sc. in Computer Science @ AGH UST in Poland • Ph.D. dropout (machine learning) • Previously NLP/ML @ Fujitsu Laboratories, Kanagawa • Currently Lead/Senior Machine Learning Engineer @ H2O.ai (remotely from Tokyo) • Conference speaker (Strata Beijing/NY/Singapore, Hadoop World Tokyo etc.)
  4. 4. About H2O.ai FOUNDED 2012, SERIES C IN NOV, 2017 PRODUCTS • DRIVERLESS AI – AUTOMATED MACHINE LEARNING • H2O OPEN SOURCE MACHINE LEARNING • SPARKLING WATER • H2O4GPU OS ML GPU LIBRARY MISSION DEMOCRATIZE AI TEAM • ~100 EMPLOYEES • SEVERAL KAGGLE GRANDMASTERS • DISTRIBUTED SYSTEMS ENGINEERS DOING MACHINE LEARNING • WORLD-CLASS VISUALIZATION DESIGNERS OFFICES MOUNTAIN VIEW, LONDON, PRAGUE
  5. 5. Community Adoption * DATA FROM GOOGLE ANALYTICS EMBEDDED IN THE END USER PRODUCT
  6. 6. Select Customers Financial InsuranceMarketing TelecomHealthcareRetail “Overall customer satisfaction is very high.” - Gartner Advisory & Accounting
  7. 7. A bit of history: H2O-3
  8. 8. H2O-3 Overview • Distributed implementations of cutting edge ML algorithms. • Core algorithms written in high performance Java. • APIs available in R, Python, Scala, REST/JSON. • Interactive Web GUI called H2O Flow. • Easily deploy models to production with H2O Steam.
  9. 9. H2O-3 Distributed Computing • Multi-node cluster with shared memory model. • All computations in memory. • Each node sees only some rows of the data. • No limit on cluster size. • Distributed data frames (collection of vectors). • Columns are distributed (across nodes) arrays. • Works just like R’s data.frame or Python Pandas DataFrame H2O Frame H2O Cluster
  10. 10. H2O-3 Algorithms Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  11. 11. DriverlessAI & Feature Engineering
  12. 12. The Need for Automation “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts” –McKinsey Prediction for 2018
  13. 13. Recipe for Success Auto Feature Generation
 Kaggle Grand Master Out of the Box • Automatic Text Handling • Frequency Encoding • Cross Validation Target Encoding • Truncated SVD • Clustering and more Feature Transformations Generated Features Original Features
  14. 14. Recipe for Success
  15. 15. Recipe for Success Driverless AI AI to do AI
  16. 16. 3 Pillars Speed Accuracy Interpretability
  17. 17. The need for GPUs
  18. 18. Moore’s Law 1980 1990 2000 2010 2020 102 103 104 105 106 107 40 Years of Microprocessor Trend Data Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp Single-threaded perf 1.5X per year 1.1X per year Transistors
 (thousands)
  19. 19. GPU 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE
  20. 20. GoAI
  21. 21. GPU Shortcomings GPU Global Memory Thread Local Thread Local Thread Local Shared Thread Local Thread Local Thread Local Shared Thread Local Thread Local Thread Local Shared Thread Local Thread Local Thread Local Shared CPU Host Memory C PU copies data from host to G PU m em ory via PC I-E CPU launches kernels SLOW!!!
  22. 22. GPU Open Analytics Initiative (GOAI) github.com/gpuopenanalytics GPU Data Frame (GDF) Ingest/
 Parse Exploratory Analysis Feature Engineering ML/DL Algorithms Grid Search Scoring Model
 Export
  23. 23. GOAI Data Flow
  24. 24. GPU Overview
  25. 25. GPU architecture Low latency vs High throughput GPU • Optimized for data-parallel, throughput computation • Architecture tolerant of memory latency • More transistors dedicated to computation CPU • Optimized for low-latency access to cached data sets • Control logic for out-of-order and speculative execution
  26. 26. GPU Enhanced Applications Application Code GPU Use GPU to Parallelize Compute-Intensive Functions CPU Rest of Sequential CPU Code
  27. 27. Machine Learning on GPU
  28. 28. Machine Learning and GPUs 2 4 A 3 5 m ⇥ k 2 4 B 3 5 k ⇥ n = 2 4 C 3 5 m ⇥ n
  29. 29. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  30. 30. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  31. 31. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  32. 32. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  33. 33. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C C[0,0] C[0,1] C[n,m]
  34. 34. Matrix Operations in ML Matrix Multiplication! All black lines are matrix multiplications!
  35. 35. H2O4GPU
  36. 36. Practical Machine Learning Machine Learning
  37. 37. H2O4GPU • Open-Source: https://github.com/h2oai/h2o4gpu • Collection of important ML algorithms ported to the GPU (with CPU fallback option): • Gradient Boosted Machines • GLM • Truncated SVD • PCA • KMeans • (soon) Field Aware Factorization Machines • Performance optimized, multi-GPU support (certain algorithms) • Used within our own Driverless AI Product to boost performance 30X • Scikit-Learn compatible Python API (and now R API)
  38. 38. H2O4GPU Algorithms 10X XGBoost 5X GLM 40X K-means 5X SVD
  39. 39. Gradient Boosting Machines • Based upon XGBoost • Raw floating point data -> Binned into Quantiles • Quantiles are stored as compressed instead of floats • Compressed Quantiles are efficiently transferred to GPU • Sparsity is handled directly with highly GPU efficiency • Multi-GPU by sharding rows using NVIDIA NCCL AllReduce
  40. 40. KMeans • Significantly faster than Scikit-learn implementation (up to 50x) • Significantly faster than other GPU implementations (5x-10x) • Supports kmeans|| initialization • Supports multiple GPUs by sharding the dataset • Supports batching data if exceeds GPU memory
  41. 41. 12 with kmeans||
  42. 42. Truncated SVD & PCA • Matrix decomposition • Popular for text processing and dimensionality reduction • GPU optimizes linear algebra operations
  43. 43. Truncated SVD & PCA • The intrinsic dimensionality of certain datasets is much lower than the original (e.g. here 4096 vs. actual ~200) • PCA can reduce the dimensionality and preserve most of the explained variance at the same time • Better input for further modeling - takes less time
  44. 44. Field Aware Factorization Machines * under development • Click Through Rate (CTR): • One of the most important tasks in computational advertising • Percentage of users, who actually click on ads • Until recently solved with logistic regression - bad at finding feature conjunctions (learns the effect of all variables or features individually) Clicked Publisher (P) Advertiser (A) Gender (G) Yes ESPN Nike Male No NBC Adidas Male
  45. 45. Field Aware Factorization Machines * under development • Separates the data into fields (Publisher, Advertiser, Gender) and features (EPSN, NBC, Adidas, Nike, Male, Female) • Uses a latent space for each pair to generate the model • Used to win the first prize of three CTR competitions hosted by Criteo, Avazu, Outbrain, and also the third prize of RecSys Challenge 2015.
  46. 46. Demo
  47. 47. More info • Documentation: http://docs.h2o.ai • Online Training: http://learn.h2o.ai • Tutorials: https://github.com/h2oai/h2o-tutorials • Slidedecks: https://github.com/h2oai/h2o-meetups • Video Presentations: https://www.youtube.com/user/0xdata • Events & Meetups: http://h2o.ai/events • Code: http://github.com/h2oai/ • Questions: • https://stackoverflow.com/questions/tagged/h2o4gpu • https://gitter.im/h2oai/{h2o-3,h2o4gpu}
  48. 48. Thank you! @mdymczyk mateusz@h2o.ai
  49. 49. Q&A

×