Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SNAP MACHINE LEARNING

1,331 views

Published on

SNAP Machine Learning

Published in: Technology
  • I pasted a website that might be helpful to you: ⇒ www.HelpWriting.net ⇐ Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You have to choose carefully. ⇒ www.WritePaper.info ⇐ offers a professional writing service. I highly recommend them. The papers are delivered on time and customers are their first priority. This is their website: ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

SNAP MACHINE LEARNING

  1. 1. Introduction to Snap Machine Learning Andreea Anghel Post-doctoral Researcher IBM Research – Zurich IBM Research - Zurich / Introduction to Snap Machine Learning / June 2018 / © 2018 IBM Corporation
  2. 2. Contents 2 Code Examples snap-ml-local snap-ml-mpi snap-ml-spark Experimental Results Application and datasets Single-node experiments Out-of-core performance (PCIe) Out-of-core performance (NVLINK) Dealing with slow networks Terabyte-scale benchmark Conclusions Introduction What are GLMs? Why are GLMs useful? What is Snap Machine Learning? Snap ML Architecture Multi-level parallelism Data-parallel framework Inter-node communication Intra-node communication Implementation GPU Acceleration Out-of-core learning (DuHL) Streaming pipeline Software architecture
  3. 3. Introduction 3 What are GLMs? Why are GLMs useful? What is Snap Machine Learning?
  4. 4. What are GLMs? 4 Ridge Regression Lasso Regression Support Vector Machines Logistic Regression Generalized Linear Models Classification Regression
  5. 5. Fast Training Can scale to datasets with billions of examples and/or features. Why are GLMs useful? Less tuning State-of-the-art algorithms for training linear models do not involve a step-size parameter. Widely used in industry The Kaggle ”State of Data Science” survey asked 16,000 data scientists and ML practitioners what tools and algorithms they use on a daily basis. 37.6% of respondents use Neural Networks 63.5% of respondents use Logistic Regression Interpretability New data protection regulations in Europe (GDPR) give E.U. citizens the right to “obtain an explanation of a decision reached” by an algorithm. 5
  6. 6. What is Snap Machine Learning? 6 Framework Models GPU Acceleration Distributed Training Sparse Data Support scikit-learn ML/{DL} No No Yes Apache Spark* MLlib ML/{DL} No Yes Yes TensorFlow** ML Yes Yes Limited Snap ML GLMs Yes Yes Yes Machine Learning (ML) Deep Learning (DL) GLMs Snap ML: A new framework for fast training of GLMs * The Apache Software Foundation (ASF) owns all Apache-related trademarks, service marks, and graphic logos on behalf of our Apache project communities, and the names of all Apache projects are trademarks of the ASF. **TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
  7. 7. Snap ML Architecture 7 Multi-level parallelism Data-parallel framework Inter-node communication Intra-node communication
  8. 8. Level 1 Parallelism across nodes connected via a network interface. Level 2 Parallelism across GPUs within the same node connected via an interconnect (e.g. NVLINK). Level 3 Parallelism across the streaming multiprocessors of the GPU hardware. Multi-level Parallelism 8
  9. 9. Node 0 Data-Parallel Framework 9 Large Dataset Partition 0 GPU 0 GPU 1 GPU 2 GPU 3 Partition (0,0) Partition (0,1) Partition (0,2) Partition (0,3) Node 1 Partition 1 GPU 0 GPU 1 GPU 2 GPU 3 Partition (1,0) Partition (1,1) Partition (1,2) Partition (1,3)
  10. 10. Inter-node communication 10 Node 0 10Gbit/s Node 1 Node 2 Node 3
  11. 11. Intra-node communication POWER is a registered trademark of International Business Machines Corporation in the United States, other countries, or both. 11
  12. 12. Snap ML Implementation 12 GPU Acceleration Out-of-code learning (DuHL) Streaming pipeline Software architecture
  13. 13. Each local sub-task can be solved effectively using stochastic coordinate descent (SCD). [Shalev-Schwartz 2013] Recently, asynchronous variants of SCD have been proposed to run on multi-core CPUs. [Liu 2013] [Tran 2015] [Hsiesh 2015] Twice parallel asynchronous SCD (TPA-SCD) is a another recent variant designed to run on GPUs. It assigns each coordinate update to a different block of threads that executed asynchronously. Within each thread block, the coordinate update is computed using many tightly coupled threads. GPU Acceleration 13 Local Model T. Parnell, C. Dünner, K. Atasu, M. Sifalakis and H. Pozidis, "Large- scale stochastic learning using GPUs”, ParLearning 2017
  14. 14. Out-of-Core Learning (DuHL) 14 Main Memory CPU Randomly sample points and compute the duality gap Periodically identify set of points with largest gap and ensure these points are in GPU memory Local Data GPU Update the model using the subset of the data that is currently in memory Model Subset of data Periodically copy model back to CPU C. Dünner, T. Parnell and M. Jaggi, “Efficient Use of Limited-Memory Resources to Accelerate Linear Learning”, NIPS 2017
  15. 15. While DuHL can minimize the amount of data transfer between CPU and GPU, it can still become a bottleneck particularly in the beginning of training. Using CUDA streams, we have implemented a training pipeline whereby the next set of data is copied onto the GPU while the current set is being trained. TPA-SCD also requires making a random permutation of the coordinates in the GPU memory at any given time. To achieve this we introduce a 3rd pipeline stage whereby the CPU generates a set of numbers for the set of points that are currently being copied. In the next stage, these random numbers are copied onto the GPU and sorted to produce the desired permutation. Streaming Pipeline 15 Copy data (i+1)RNG (i+1) Sort RN (i) Train (i) Copy RN (i+1) Copy data (i+2)RNG (i+2) Sort RN (i+1) Train (i+1) Copy RN (i+2) Copy data (i+3)RNG (i+3) Sort RN (i+2) Train (i+2) Copy RN (i+3) CPU Interconnect GPU
  16. 16. libglm Underlying C++/CUDA template library. snap-ml-local Small to medium- scale data. Single node deployment. Multi-GPU support. scikit-learn compatible Python API snap-ml-mpi Large-scale data. Multi-node deployment in HPC environments. Many-GPU support. Python API. snap-ml-spark Large-scale data Multi-node deployment in Apache Spark environments. Many GPU support Python/Scala /Java* API. *Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. 16
  17. 17. Code Examples 17 snap-ml-local snap-ml-mpi snap-ml-spark
  18. 18. Example: snap-ml-local Acceleration existing scikit-learn applications by changing only 2 lines of code.
  19. 19. Example: snap-ml-mpi Describe application using high-level Python code. Launch application on 4 nodes using mpirun (4 GPUs per node):
  20. 20. Example: snap-ml-spark Describe application using high-level Python code: Launch application on Spark cluster (1 GPU per Spark executor):
  21. 21. Experimental Results 21 Application and datasets Single-node experiments Out-of-core performance (PCIe) Out-of-core performance (NVLINK) Dealing with slow networks Terabyte-scale benchmark
  22. 22. Can use train ML models to predict whether or not a user will click on an advert? Click-through Rate (CTR) Prediction Core business of many internet giants. Labelled training examples are being generated in real-time. Dataset: criteo-tb Number of features: 1 million Number of examples: 4.2 billion Size: 3 Terabytes (SVM Light) Dataset: criteo-kaggle Number of features: 1 million Number of examples: 45 million Size: 40 Gigabytes (SVM Light) 22
  23. 23. Single-node Performance 23 Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-kaggle 45 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0
  24. 24. Out-of-core performance (PCIe Gen3) 24 Train chunk (i) on GPU Copy chunk (i+1) onto GPU 90ms 318ms S1 Init 12ms 330ms S2 Train chunk (i+1) on GPU Copy chunk (i+2) onto GPU 90ms 318ms Init 12ms 330ms Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 200 million 1x Intel Xeon* Gold 6150 1x V100 n/a PCI Gen3 *Intel Xeon is a trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
  25. 25. Out-of-core performance (NVLINK 2.0) 25 90ms 55ms S1 3ms 93ms S2 Train chunk (i+1) on GPU Copy chunk (i+2) onto GPU 90ms 55ms Init 3ms 93ms Train chunk (i+2) on GPU Copy chunk (i+3) onto GPU 90ms 55ms Init 3ms 93ms Train chunk (i+3) on GPU Copy chunk (i+4) onto GPU 90ms 55ms Init 3ms 93ms Train chunk (i+4) on GPU Copy chunk (i+5) onto GPU 90ms 55ms Init 3ms 93ms Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 200 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0 Train chunk (i) on GPU Copy chunk (i+1) onto GPU Init
  26. 26. Dealing with slow networks 26 Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 1 billion 4x Power AC922 server 16x V100 1Gbit Ethernet NVLINK 2.0 criteo-tb 1 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0
  27. 27. Terabyte-scale Benchmark 27 Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 4.2 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0
  28. 28. Conclusions Snap ML is a new framework for fast training of GLMs. Snap ML benefits from GPU acceleration. It can be deployed in single-node and multi-node environments. The hierarchical structure of the framework makes it suitable for cloud-based deployments. It can leverage fast interconnects like NVLINK to achieve streaming, out-of-core performance. Snap ML significantly outperforms other software frameworks for training GLMS in both single-node and multi-node benchmarks. Snap ML can train a logistic regression classifier on the Criteo Terabyte Click Logs data in 1.5 minutes. Snap ML will be available to try in June as part of IBM PowerAI (Tech Preview) Celestine Dünner Dimitrios Sarigiannis Andreea Anghel Nikolas Ioannou Haris Pozidis Thomas Parnell The Snap ML team: 28 https://www.zurich.ibm.com/snapml/
  29. 29. 29

×