Your SlideShare is downloading.
×

- 1. Machine Learning at the Limit John Canny*^ * Computer Science Division University of California, Berkeley ^ Yahoo Research Labs @Alpine Labs, March, 2015
- 2. Outline Scaling Inward: • What are the performance limits for machine learning on single machines, and how close can we get to them? (Single-node Roofline Design) Scaling Out: • What happens when we try to scale up from optimized single nodes to clusters? (Roofline Design for Clusters)
- 3. My Other Job(s) Yahoo [Chen, Pavlov, Canny, KDD 2009]* Ebay [Chen, Canny, SIGIR 2011]** Quantcast 2011-2013 Microsoft 2014 Yahoo 2015 * Best application paper prize ** Best paper honorable mention
- 4. Data Scientist’s Workflow Digging Around in Data Hypothesize Model Customize Large Scale Exploitation Evaluate Interpret Sandbox Production
- 5. Data Scientist’s Workflow Digging Around in Data Hypothesize Model Customize Large Scale Exploitation Evaluate Interpret Sandbox Production
- 6. Why Build a New ML Toolkit? • Performance: GPU performance pulling away from other platforms for *sparse* and dense data. Minibatch + SGD methods dominant on Big Data,… • Customizability: Great value in customizing models (loss functions, constraints,…) • Explore/Deploy: Explore fast, run the same code in prototype and production. Be able to run on clusters.
- 7. Desiderata • Performance: • Roofline Design (single machine and cluster) • General Matrix Library with full CPU/GPU acceleration • Customizability: • Modular Learner Architecture (reusable components) • Likelihood “Mixins” • Explore/Deploy: • Interactive, Scriptable, Graphical • JVM based (Scala) w/ optimal cluster primitives
- 8. Outline Scaling Inward: • What are the performance limits for machine learning on single machines, and how close can we get to them? (Single-node Roofline Design) Scaling Out: • What happens when we try to scale up from optimized single nodes to clusters? (Roofline Design for Clusters)
- 9. Roofline Design (Williams, Waterman, Patterson, 2009) • Roofline design establishes fundamental performance limits for a computational kernel. Operational Intensity (flops/byte) Throughput(gflops) 1 10 100 1000 0.01 0.1 1 10 100 GPU ALU throughput CPU ALU throughput 1000
- 10. A Tale of Two Architectures Intel CPU NVIDIA GPU Memory Controller L3 Cache Core ALU Core ALU Core ALU Core ALU L2 Cache ALU ALU ALU ALU ALU ALU
- 11. CPU vs GPU Memory Hierarchy Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU 8 MB L3 Cache 1.5 MB L2 Cache 4 MB register file (!)4kB registers: 1 MB Shared Mem 2 MB L2 Cache 512K L1 Cache 1 MB Constant Mem 1 TB/s1 TB/s 40 TB/s 13 TB/s 5 TB/s 10s GB Main Memory 4 GB Main Memory 20 GB/s 500 GB/s 500 GB/s 200 GB/s
- 12. Roofline Design – Matrix kernels • Dense matrix multiply • Sparse matrix multiply Operational Intensity (flops/byte) Throughput(gflops) 1 10 100 1000 0.01 0.1 1 10 100 GPU ALU throughput CPU ALU throughput 1000
- 13. DataSource (JBOD disks) Learner Model Optimizer Mixins Model Optimizer Mixins GPU 1 thread 1 GPU 2 thread 2 : : CPU host code data blocks DataSource (Memory) DataSource HDFS over network Zhao+Canny SIAM DM 13, KDD 13, BIGLearn 13 A Rooflined Machine Learning Toolkit Compressed disk streaming at ~ 0.1-2 GB/s 100 HDFS nodes 30 Gflops to 2 Teraflops per GPU
- 14. Matrix + Machine Learning Layers Written in the beautiful Scala language: • Interpreter with JIT, scriptable. • Open syntax +,-,*, ,, etc, math looks like math. • Java VM + Java codebase – runs on Hadoop, Yarn, Spark. • Hardware acceleration in C/C++ native code (CPU/GPU). • Easy parallelism: Actors, parallel collections. • Memory management (sort of ). • Pre-built for multiple Platforms (Windows, MacOS, Linux). Experience similar to Matlab, R, SciPy
- 15. Benchmarks Recent benchmarks on some representative tasks: • Text Classification on Reuters news data (0.5 GB) • Click prediction on the Kaggle Criteo dataset (12 GB) • Clustering of handwritten digit images (MNIST) (25 GB) • Collaborative filtering on the Netflix prize dataset (4 GB) • Topic modeling (LDA) on a NY times collection (0.5 GB) • Random Forests on a UCI Year Prediction dataset (0.2 GB) • Pagerank on two social network graphs at 12GB and 48GB
- 16. Benchmarks Systems (single node) • BIDMach • VW (Vowpal Wabbit) from Yahoo/Microsoft • Scikit-Learn • LibLinear Cluster Systems • Spark v1.1 and v1.2 • Graphlab (academic version) • Yahoo’s LDA cluster
- 17. Benchmarks: Single-Machine Systems System Algorithm Dataset Dim Time (s) Cost ($) Energy (KJ) BIDMach Logistic Reg. RCV1 103 14 0.002 3 Vowpal Wabbit Logistic Reg. RCV1 103 130 0.02 30 LibLinear Logistic Reg. RCV1 103 250 0.04 60 Scikit-Learn Logistic Reg. RCV1 103 576 0.08 120 RCV1: Text Classification, 103 topics (0.5GB). Algorithms were tuned to achieve similar accuracy.
- 18. Benchmarks: Cluster Systems System A/B Algorithm Dataset Dim Time (s) Cost ($) Energy (KJ) Spark-72 BIDMach Logistic Reg. RCV1 1 103 30 14 0.07 0.002 120 3 Spark-64 BIDMach RandomForest YearPred 1 280 320 0.48 0.05 480 60 Spark-128 BIDMach Logistic Reg. Criteo 1 400 81 1.40 0.01 2500 16 Spark-XX = System with XX cores BIDMach ran on one node with GTX-680 GPU
- 19. Benchmarks: Cluster Systems System A/B Algorithm Dataset Dim Time (s) Cost ($) Energy (KJ) Spark-384 BIDMach K-Means MNIST 4096 1100 735 9.00 0.12 22k 140 GraphLab-576 BIDMach Matrix Factorization Netflix 100 376 90 16 0.015 10k 20 Yahoo-1000 BIDMach LDA (Gibbs) NYtimes 1024 220k 300k 40k 60 4E10 6E7 Spark-XX or GraphLab-XX = System with XX cores Yahoo-1000 had 1000 nodes
- 20. BIDMach at Scale Latent Dirichlet Allocation BIDMach outperforms cluster systems on this problem, and has run up to 10 TB on one node. Convergence on 1TB data
- 21. Benchmark Summary • BIDMach is at least 10x faster than other single-machine systems for comparable accuracy. • For Random Forests or single-class regression, BIDMach on a GPU node is comparable with 8-16 worker clusters. • For multi-class regression, factor models, clustering etc., GPU-assisted BIDMach is comparable to 100-1000-worker clusters. Larger problems correlate with larger values in this range.
- 22. Cost and Energy Use • BIDMach had a 10x-1000x cost advantage over the other systems. The ratio was higher for larger-scale problems. • Energy savings were similar to the cost savings, at 10x- 1000x.
- 23. In the Wild (Examples from Industry) • Multilabel regression problem (summer intern project): • Existing tool (single-machine) took ~ 1 week to build a model. • BIDMach on a GPU node takes 1 hour (120x speedup) • Iteration and feature engineering gave +15% accuracy. • Auction simulation problem (cluster job): • Existing tool simulates auction variations on log data. • On NVIDIA 3.0 devices (64 registers/thread) we achieve a 70x speedup over a reference implementation in Scala • On NVIDIA 3.5 devices (256 registers/thread) we can move auction state entirely into register storage and gain a 400x speedup.
- 24. In the Wild (Examples from Industry) • Classification (cluster job): • Cluster job (logistic regression) took 8 hours. • BIDMach version takes < 1 hour on a single node. • SVMs for image classification (single machine) • Large multi-label classification took 1 week with LibSVM. • BIDMach version (SGD-based SVM) took 90 seconds.
- 25. Performance Revisited • BIDMach had a 10x-1000x cost advantage over the other systems. The ratio was higher for larger-scale problems. • Energy savings were similar to the cost savings, at 10x- 1000x. But why?? • We only expect about 10x from GPU acceleration?
- 26. The Exponential Growth of O(1) Once upon a Time… a = b + c // single instruction y = A[i] // single instruction Now five orders of magnitude
- 27. The Exponential Growth of O(1) By slight abuse of notation, let O(1) denote the ratio of times for a main memory access vs an ALU operation. Then O(1) is not only not “small,” but its growing exponentially with time. A lot of non-rooflined code is too “memory-hungry” – you can often do (much) better with custom, memory-aware kernels. Log (Memaccess/ALUop) Time (Years)
- 28. Performance Creep • Code that isnt explicitly rooflined seems to be creeping further and further from its roofline limit. • With Intel’s profiling tools we can check the throughput of a variety of running programs – often 10s–100s of Mflops – memory bound. • Need well-designed kernels (probably hardware-aware code) to get to the limits. • Coding for performance: Avoid: Large hash tables Large binary trees Linked data structures Use: Sorting Memory B-trees Packed data structures
- 29. BIDMach MLAlgorithms 1. Regression (logistic, linear) 2. Support Vector Machines 3. k-Means Clustering 4. Topic Modeling - Latent Dirichlet Allocation 5. Collaborative Filtering 6. NMF – Non-Negative Matrix Factorization 7. Factorization Machines 8. Random Forests 9. Multi-layer neural networks 10. IPTW (Causal Estimation) 11. ICA = Likely the fastest implementation available
- 30. CPU vs GPU Memory Hierarchy Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU 8 MB L3 Cache 1.5 MB L2 Cache 4 MB register file (!)4kB registers: 1 MB Shared Mem 2 MB L2 Cache 512K L1 Cache 1 MB Constant Mem 1 TB/s1 TB/s 40 TB/s 13 TB/s 5 TB/s 10s GB Main Memory 4 GB Main Memory 20 GB/s 500 GB/s 500 GB/s 200 GB/s
- 31. CoDesign: Gibbs Sampling • Gibbs sampling is a very general method for inference in machine learning, but typically slow. • But it’s a particularly good match for SIMT hardware (GPUs), not difficult to get 10-100x speedups. • But many popular models have slow convergence: • Latent Dirichlet Allocation • 10 passes of online Variational Bayes ≈ • 1000 passes for collapsed Gibbs sampling
- 32. Consequences LDA graphical model: • Topic samples Z are discrete and sparse: very few samples for a specific topic in a particular document. • This gives a high-variance random walk through parameter space before converges. Documents
- 33. Gibbs Parameter Estimation Gibbs sampling in ML often applied to distributions of the form P(D,Z,) Where D is the data, are the parameters of the model, and X are other hidden states we would like to integrate over. e.g for LDA, are the , parameters and Z are assignments of words to topics. • Gibbs sampling most often simply samples over both, which is slow. • But we really want to optimize over , and marginalize (integrate) over X.
- 34. State-Augmented Marginal Estimation (SAME) – Doucet et al. 2002 The marginal parameter distribution is 𝑃 Θ = 𝑃 𝑋, Θ 𝑑𝑋 We can sharpen it to 𝑃 𝑘 Θ as follows 𝑃 𝑘 Θ = … 𝑃 𝑋, Θ 𝑍𝑌𝑋 𝑃 𝑌, Θ … 𝑃 𝑍, Θ 𝑑𝑋𝑑𝑌 … 𝑑𝑍 𝑃 𝑘 Θ has the same peaks as 𝑃 Θ and corresponds to a Gibbs distribution cooled to T = 1/k. We modify a standard, blocked Gibbs sampler to take k samples instead of 1 for X, and then recompute Θ from these samples.
- 35. SAME Sampling In the language of graphical models: Run independent simulations with tied parameters Θ Θ
- 36. SAME sampling as cooling What cooling does: Likelihood P() in model parameter space (peaks are good models)
- 37. SAME sampling as cooling What cooling does: Likelihood P() in model parameter space (peaks are good models)
- 38. SAME sampling as cooling What cooling does: Likelihood P() in model parameter space (peaks are good models)
- 39. Making SAME sampling fast The approach in general: • Wherever you would take 1 sample before, take k independent samples and average for parameter estimation. Instead of taking distinct samples take counts of anonymous samples – complexity independent of k • Exact for LDA, HMM, Factorization Machines,… • Corresponds to factored joint approximation for other models. Varying k allows for annealing optimization over .
- 40. Results on LDA • We use k=100, make 10 passes over the dataset (effectively taking 1000 total samples). • Our SAME Gibbs sampler is within a factor of 3 of online Variational Bayes. More than 100x faster than standard Gibbs samplers. • The model was more accurate than any other LDA method we tested:
- 41. SAME sampling in general • SAME sampling is an alternative to other message-passing methods (VMP etc.) • But rather than using bounds on posterior distributions it: • Samples from the distribution of “uninteresting” variables • Sharpens the posteriors on parameter variables.
- 42. Current Work Inference on general graphical models with Gibbs sampling. Larger graphical models arise in many applications but are discarded for performance reasons. Existing tools (JAGS, Stan, Infer.net) are far from the roofline for this problem.
- 43. Inference for Bayesian Networks • (Main memory) Rooflined Approach: Sampling distribution can be computed with an SpMM – 1-2GF on CPU and 5- 10GF on GPU. • Experiment on MOOC learning concept graph with ~300 nodes and ~350 edges, 3000 samples
- 44. Getting to Warp Speed The model for this problem (CPTs for all nodes) fits comfortably in GPU SHMEM. The state of thousands of particles fits comfortably in GPU register memory. It should be possible to gain another 10x to 100x in performance with this design, and approach ALU limits. 4 MB registers 1 MB Shared Mem
- 45. Outline Scaling Inward: • What are the performance limits for machine learning on single machines, and how close can we get to them? (Single-node Roofline Design) Scaling Out: • What happens when we try to scale up from optimized single nodes to clusters? (Roofline Design for Clusters)
- 46. Challenges with Scaling MB algorithms Can we combine node acceleration with clustering? Its harder than it seems: Dataset Minibatch updates Single-node Minibatch Learner Dataset partitions across nodes Cluster-based Learner Reduce (Allreduce)
- 47. Allreduce Every node j has data Uj (a local copy of a global model). We want to compute an aggregate (e.g. sum) of all data vectors and distribute it to all nodes, i.e. to synchronize the model across nodes
- 48. Why Allreduce? • Speed: Minibatch SGD and related algorithms are the fastest way to solve many of these problems (deep learning, CF, topic modeling,…). But dataset sizes (several TB) make single machine training slow (> 1 day). • The more updates/sec the faster the method converges, but: • In practice minibatch dof ~ model dof is near-optimal. • That means the ideal B/W for model updates is at least the rate of data consumption (several Gb/s). • In other words, we want an Allreduce that consumes data at roughly the full NIC bandwidth.
- 49. Why Allreduce? • Scale: Some models wont fit on one machine (e.g. Pagerank). • Other models will fit but minibatch updates only touch a small fraction of features (Click Prediction). Updating only those is much faster than a full Allreduce. • Sparse Allreduce: Each node j pushes updates to a subset of features Uj and pulls a subset Vj from a distributed model.
- 50. Why Allreduce? • Worker driven (emergency room model) vs. aggregator- driven (doctor’s surgery)? • “Excuse me, worker task, the aggregator will consume your data now”
- 51. Lower Bounds for total message B/W. • Intuitively, every node needs to send D values somewhere, and receive D values for the final aggregate. • The data from each node needs to be caught somewhere if allreduce happens in the same network, and the final aggregate message need to originate from somewhere. • This suggests a naïve lower bound of 2D values in or out of each node, or 2DN for an N-node network. • In fact more careful analysis shows the lower bound is 2D(N-1) for total bandwidth, and ~ 2D/B for latency: Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of workstations
- 52. Why not Parameter servers? • Potentially B/W optimal, but latency limited by net B/W of servers: Clients servers This network has at least 2D(N-1)/PB latency with P servers, suboptimal by (N-1)/P.
- 53. A Roofline For networks • Optimal operating point Packet size (kBytes) Throughput(MB/s) 1 10 100 1000 1 10 100 1000 10000 Network capacity 100000 10000
- 54. Allreduce (example: Click Prediction) • To run the most efficient solver (distributed SGD), we need to synchronize models across the network. • An Allreduce (reduce+broadcast) operation both synchronizes and combines the updates from each node. • For Logistic Regression, SVM and many other problems, we need a Sparse Allreduce.
- 55. What about MapReduce? The all-to-all Reduce implementations in Hadoop, Spark, Powergraph etc. having scaling problems because of message overhead. Smaller messages lower throughput. During reduce, N nodes exchange O(N2) messages of diminishing size.
- 56. Solutions (Dense data) • For dense data, there are optimal (and fast in practice) methods for Allreduce on clusters. • They are implemented in MPI, but not always used by default. • In a recent Baidu paper (Wu et al.*), the authors achieved near-linear speedup on an Infiniband cluster using MPI all-to- all primitives. * “Deep Image: Scaling up Image Recognition” Arxiv 1501.02876v2
- 57. Kylix: A Sparse AllReduce for Commodity Clusters Huasha Zhao John Canny Department of Electrical Engineering and Computer Sciences University of California, Berkeley
- 58. Power-Law Data Term-Document Graph
- 59. Power-Law Data 𝑭 = 𝒓−𝜶 F: frequency r: rank α: power law exponent
- 60. Sparse AllReduce • Allreduce is a general primitive for distributed graph mining and machine learning. • Reduce, e.g. 𝑣 = 𝑣𝑖 𝑚 𝑖=1 or other commutative and associative aggregate operator such as mean, max etc. • And redistributed back across nodes. • Fast Allreduce makes the cluster behave like a single giant node. • 𝑣𝑖 is sparse (power law) for big data problems, and each cluster node may not require all of the sum v but only a sparse subset. • Kylix is an efficient Sparse Allreduce primitive for computing on commodity clusters.
- 61. • Loss function • The derivative of loss, which defines the SGD update, is a scaled copy of Xi – same sparse non-zeros as Xi Mini-batch Machine Learning …Features Examples Xi Xi+1
- 62. Mini-batch Machine Learning …Features Examples Xi Xi+1 Pull requests: non-zeros of next iteration Push requests: non-zeros of current iteration
- 63. Kylix – Nested Heterogeneous Butterfly • Data split to 6 machines
- 64. Kylix – Nested Heterogeneous Butterfly • Each machine is responsible for reducing 1/6 of the indices.
- 65. Kylix – Nested Heterogeneous Butterfly • Scatter-Reduce along dimension 1 • data split by 3
- 66. Kylix – Nested Heterogeneous Butterfly • Scatter-Reduce along dimension 2 • Reduce for shorter range, but denser data grey scale indicates density
- 67. Kylix – Nested Heterogeneous Butterfly • Allgather in a reverse order • Redistribute reduced values grey scale indicates density
- 68. Kylix – Nested Heterogeneous Butterfly • Heterogeneous degree allows tailoring message size for each layer in order to achieve optimal message size (larger than the efficient message floor).
- 69. Kylix – Nested Heterogeneous Butterfly • Heterogeneous degree allows tailoring message size for each layer – achieve optimal message size. • Nesting allows indices to be collapsed down the network, so that communication is greatly reduced in the lower layers – works particularly well for power data.
- 70. Kylix - Summary • Each network node specifies a set of input indices to pull and a set output indices to push. • Configuration step • Build the routing table, once for static graphs, multiple times for dynamic graphs • Partitioning • Union/Collapsing indices • Mapping • Reduction step • ConfigReduce • Can be performed in a single down-pass
- 71. Tuning Degrees of the Network • To compute the optimal degree for that layer, we need to know the amount of data per node. That will be the length of the partition range at that node times the density. • Given this data size, we find the largest degree d such that P/d is larger then the message floor. • Degree / diameter trade-off • degree^diameter = const.
- 72. Data Volumes at Each Layer • Total communication across all layers a small constant larger than the top layer, which is close to optimal. • Communication volume across layers has a characteristic Kylix shape.
- 73. Experiments (PageRank) • Twitter Followers’ Graph • 1.4 billion edges and 40 million vertices • Yahoo Web Graph • 6.7 billion edges and 1.4 billion vertices • EC2 cluster compute node (cc2.8xlarge) 90-node Yahoo M45 64-node EC2 64-node EC2
- 74. Summary • Roofline design exposes fundamental limits for ML tasks, and provides guidance to reaching them. • Roofline design gives BIDMach one to several orders of magnitude advantage over other tools – most of which ware well below roofline limits. • Rooflining is a powerful tool for network primitives and allows cluster computing to approach linear speedups (strong scaling).
- 75. Software (version 1.0 just released) Code: github.com/BIDData/BIDMach Wiki: http://bid2.berkeley.edu/bid-data-project/overview/ BSD open source libs and dependencies. In this release: • Random Forests • Double-precision GPU matrices • Ipython/IScala Notebook • Simple DNNs Wrapper for Berkeley’s Caffe coming soon)
- 76. Thanks Sponsors: Collaborators: