Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

1,716 views

Published on

Large-scale Machine Learning: Deep, Distributed and Multi-Dimensional:
Modern machine learning involves deep neural network architectures which yields state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache MXNet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed settings.
Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional information in data end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with MXNet backend interface for large-scale efficient learning.

Bio: Anima Anandkumar is a principal scientist at Amazon Web Services and a Bren professor at Caltech CMS department. Her research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. She is the recipient of several awards such as the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, Google research award, ARO and AFOSR Young Investigator Awards, NSF Career Award, Early Career Excellence in Research Award at UCI, Best Thesis Award from the ACM Sigmetrics society, IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums such as the yourstory, Quora ML session, O’Reilly media, and so on. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, an assistant professor at U.C. Irvine between 2010 and 2016, and a visiting researcher at Microsoft Research New England in 2012 and 2014.

Published in: Technology
  • Be the first to comment

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

  1. 1. Learning at Scale: Deep, Distributed and Multi-dimensional Anima Anandkumar .. Amazon AI & Caltech
  2. 2. Significantly improve many applications on multiple domains “deep learning” trend in the past 10 years image understanding speech recognition natural language processing … Deep Learning autonomy
  3. 3. Image Classification Layer 1 Layer 2 Output multilevel feature extractions from raw pixels to semantic meanings explore spatial information with convolution layers
  4. 4. Image Classification § Hard to define the network § the definition of the inception network has >1k lines of codes in Caffe § A single image requires billions floating-point operations § Intel i7 ~500 GFLOPS § Nvidia Titan X: ~5 TFLOPS § Memory consumption is linear with number of layers State-of-the-art networks have tens to hundreds layers
  5. 5. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
  6. 6. 3. MXNet image credit - wikipedia • Imperative and Declarative Programming • Language Support • Backend and Automatic Parallelization
  7. 7. Writing Parallel Programs is Painful Each forward-backward-update involves O(num_layer), which is often 100—1,000, tensor computations and communications data = next_batch()data[gpu0].copyfrom(data[0:50]) _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu0] , fc1_weight[gpu0]) fc1_ograd[gpu0], fc2_wgrad[gpu0] = FullcBackward(fc2_ograd[gpu0] , fc2_weight[gpu0]) fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50]) fc2[gpu0] = FullcForward(fc1[gpu0], fc2_weight[gpu0]) fc1[gpu0] = FullcForward(data[gpu0], fc1_weight[gpu0]) fc2_wgrad[cpu] = fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc2_weight[cpu].copyto( fc2_weight[gpu0] , fc2_weight[gpu1]) fc2_weight[cpu] -= lr*fc12_wgrad[gpu0] fc1_weight[cpu] -= lr * fc1_wgrad[gpu0] fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_weight[cpu].copyto( fc1_weight[gpu0] , fc1_weight[gpu1]) data[gpu0].copyfrom(data[51:100]) _, fc1_wgrad[gpu1] = FullcBackward(fc1_ograd[gpu1] , fc1_weight[gpu1]) fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] , fc2_weight[gpu1]) fc2_ograd[gpu1] = LossGrad(fc2[gpu1], label[51:100]) fc2[gpu1] = FullcForward(fc1[gpu1], fc2_weight[gpu1]) fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1]) Dependency graph for 2-layer neural networks with 2 GPUs
  8. 8. Auto Parallelization 18 Write serial programs Run in parallel >>> import mxnet as mx >>> A = mx.nd.ones((2,2)) *2 >>> C = A + 2 >>> B = A + 1 >>> D = B * C >>> D.wait_to_read() A = 2 C = A + 2 B = A + 1 D = B ⨉ C
  9. 9. Data Parallelism 19 key-value store examples 1. Read a data partition 2. Pull the parameters 3. Compute the gradient 4. Push the gradient 5. Update the parameters
  10. 10. Scale to Multiple GPU Machines 21 PCIe Switch GPU GPU GPU GPU CPU Network Switch 63 GB/s 4 PCIe 3.0 16x 15.75 GB/s PCIe 3.0 16x 1.25 GB/s 10 Gbit Ethernet Hierarchical parameter server Level-1 Servers Workers Level-2 Servers GPUs CPUs
  11. 11. Experiment Setup ✧ ✓ 1.2 million images with 1000 classes ✧ Resnet 152-layer model ✧ EC2 P2.16xlarge 22 GPU 0-15 PCIe switches CPU ✧ Minibatch SGD ✧ Synchronized Updating
  12. 12. Scalability over Multiple Machines 23 time(sec)/bath 0 0.25 0.5 0.75 1 # of GPUs 0 32 64 96 128 Comm Cost batch size/GPU=2 batch size/GPU=4 batch size/GPU=8 batch size/GPU=16 115x
  13. 13. 8 2012before 2013 2014 2015 2016 2017 mxnet imperative symbolic gluon
  14. 14. Back-end System ✧ Optimization ✓ Memory optimization ✓ Operator fusion ✧ Scheduling ✓ Auto-parallelization 11 a b 1 + ⨉ c fullc softmax weight bias Back-end import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1 import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward() Front-end
  15. 15. In summary ✦ Symbolic ❖ efficient & portable ❖ but hard to use 10 ✦ tesla ✦ Imperative ❖ flexible ❖ may be slow ✦ Gluon ❖ imperative for developing ❖ symbolic for deploying
  16. 16. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
  17. 17. Tensors: Beyond 2D world Modern data is inherently multi-dimensional
  18. 18. Tensors: Beyond 2D world Modern data is inherently multi-dimensional Input Hidden 1 Hidden 2 Output
  19. 19. Tensor Contraction Extends the notion of matrix product Matrix product Mv = j vjMj = + Tensor Contraction T(u, v, ·) = i,j uivjTi,j,: = ++ +
  20. 20. Employing Tensor Contractions in Alexnet Replace fully connected layer with tensor contraction layer
  21. 21. Enabling Tensor Contraction Layer in Mxnet
  22. 22. Performance of the TCL • Trained end-to-end • On ImageNet with VGG: • 65.9% space savings • performance drop of 0.6% only • On ImageNet with AlexNet: • 56.6% space savings • Performance improvement of 0.5%
  23. 23. Low-rank tensor regression Tensor Regression Networks, J. Kossaifi, Z.C.Lipton, A.Khanna, T.Furlanello and A.Anandkumar, ArXiv pre-publication
  24. 24. Performance and rank
  25. 25. Speeding up Tensor Contractions 1 Tensor contractions are a core primitive of multilinear algebra. 2 BLAS 3: Unbounded compute intensity (no. of ops per I/O) Consider single-index contractions: CC = AA BB = = A(:,1,:) A(:,2,:)A422 B21 C421 e.g. Cmnp = Amnk Bkp
  26. 26. Speeding up Tensor Contraction Explicit permutation dominates, especially for small tensors. Consider Cmnp = Akm Bpkn. 1 Akm → Amk 2 Bpkn → Bkpn 3 Cmnp → Cmpn 4 Cm(pn) = Amk Bk(pn) 5 Cmpn → Cmnp 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 n (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions.
  27. 27. Existing Primitives GEMM Suboptimal for many small matrices. Pointer-to-Pointer BatchedGEMM Available in MKL 11.3β and cuBLAS 4.1 C[p] = α op(A[p]) op(B[p]) + β C[p] cublas<T>gemmBatched(cublasHandle_t handle, cublasOperation_t transA, cublasOperation_t transB, int M, int N, int K, const T* alpha, const T** A, int ldA, const T** B, int ldB, const T* beta, T** C, int ldC, int batchCount)
  28. 28. Tensor Contraction with Extended BLAS Primitives Cmn[p] = AmkBkn[p] cublasDgemmStridedBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, &alpha, A, ldA1, 0, B, ldB1, ldB2, &beta, C, ldC1, ldC2, P)
  29. 29. Tensor Contraction with Extended BLAS Primitives Cmnp = A∗∗ × B∗∗∗ Cmnp ≡ C[m + n · ldC1 + p · ldC2] Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2 1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] 4.1 AknBkmp Cmn[p] = Bkm[p]Akn 1.2 AmkBkpn Cmn[p] = AmkBk[p]n Cm[n]p = AmkBkp[n] 4.2 AknBkpm Cmn[p] = Bk[p]mAkn 1.3 AmkBnkp Cmn[p] = AmkBnk[p] 4.3 AknBmkp Cmn[p] = Bmk[p]Akn 1.4 AmkBpkn Cm[n]p = AmkBpk[n] 4.4 AknBpkm 1.5 AmkBnpk Cm(np) = AmkB(np)k Cmn[p] = AmkBn[p]k 4.5 AknBmpk Cmn[p] = Bm[p]kAkn 1.6 AmkBpnk Cm[n]p = AmkBp[n]k 4.6 AknBpmk 2.1 AkmBknp Cm(np) = AkmBk(np) Cmn[p] = AkmBkn[p] 5.1 ApkBkmn C(mn)p = Bk(mn)Apk Cm[n]p = Bkm[n]Apk 2.2 AkmBkpn Cmn[p] = AkmBk[p]n Cm[n]p = AkmBkp[n] 5.2 ApkBknm Cm[n]p = Bk[n]mApk 2.3 AkmBnkp Cmn[p] = AkmBnk[p] 5.3 ApkBmkn Cm[n]p = Bmk[n]Apk 2.4 AkmBpkn Cm[n]p = AkmBpk[n] 5.4 ApkBnkm 2.5 AkmBnpk Cm(np) = AkmB(np)k Cmn[p] = AkmBn[p]k 5.5 ApkBmnk C(mn)p = B(mn)kApk Cm[n]p = Bm[n]kApk 2.6 AkmBpnk Cm[n]p = AkmBp[n]k 5.6 ApkBnmk 3.1 AnkBkmp Cmn[p] = Bkm[p]Ank 6.1 AkpBkmn C(mn)p = Bk(mn)Akp Cm[n]p = Bkm[n]Akp 3.2 AnkBkpm Cmn[p] = Bk[p]mAnk 6.2 AkpBknm Cm[n]p = Bk[n]mAkp 3.3 AnkBmkp Cmn[p] = Bmk[p]Ank 6.3 AkpBmkn Cm[n]p = Bmk[n]Akp 3.4 AnkBpkm 6.4 AkpBnkm 3.5 AnkBmpk Cmn[p] = Bm[p]kAnk 6.5 AkpBmnk C(mn)p = B(mn)kAkp Cm[n]p = Bm[n]kAkp 3.6 AnkBpmk 6.6 AkpBnmk
  30. 30. A new primitive: StridedBatchedGEMM Performance on par with pure GEMM (P100 and beyond).
  31. 31. Applications: Tucker Decomposition Tmnp = GijkAmiBnjCpk mnp ijk mi njT G A B pkC Main steps in the algorithm Ymjk = TmnpBt njCt pk Yink = TmnpAt+1 mi Ct pk Yijp = TmnpBt+1 nj At+1 mi Performance on Tucker decomposition: 20 40 60 80 100 120 10−2 100 102 104 106 n Time(sec) TensorToolbox BTAS Cyclops CPU Batched GPU Batched
  32. 32. Tensor Sketches Randomized dimensionality reduction through sketching. ◮ Complexity independent of tensor order: exponential gain! +1 +1 -1 Tensor T Sketch s Applications Tensor Decomposition via Sketching Visual Question and Answering CNN RNN What is the mustach made of? C W H MCT L Avgpooling FC Relu BatchNorm FC "Banana" Softmax
  33. 33. MCT in Visual Question & Answering CNN RNN ‡ ¡t is the musta™  ¢¡£¤ ¥¦§ C ¨ r w© v ev FC Relu f ! m FC 4#$%$na4 Softma x
  34. 34. Multimodal Tensor Pooling C W H L Text feature Image feature d1 d2 d3Spatial sketch Count sketch 3D FFT 1D FFT 3D IFFT (optional) d4 d1 d2 d3
  35. 35. Tensor Decompositions
  36. 36. Extracting Topics from Documents Topics Topic Proportion police witness campus police witness campus police witness campus police witness crime Sports Educaon campus A., D. P. Foster, D. Hsu, S.M. Kakade, Y.K. Liu.“Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation,” NIPS 2012.
  37. 37. Tensor Methods for Topic Modeling campus police witness Topic-word matrix P[word = i|topic = j] Linearly independent columns Moment Tensor: Co-occurrence of Word Triplets = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness
  38. 38. Tensors vs. Variational Inference Criterion: Perplexity = exp[−likelihood]. Learning Topics from PubMed on Spark, 8mil articles 0 2 4 6 8 10 ×104 RunningTime 103 104 105 Perplexity Tensor Variational Learning network communities from social network data Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6. 102 10 3 10 4 105 10 6 RunningTime FB YP DBLPsub DBLP 10-2 10-1 10 0 101 Error FB YP DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
  39. 39. Tensors vs. Variational Inference Criterion: Perplexity = exp[−likelihood]. Learning Topics from PubMed on Spark, 8mil articles 0 2 4 6 8 10 ×104 RunningTime 103 104 105 Perplexity Tensor Variational Learning network communities from social network data Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6. 102 10 3 10 4 105 10 6 RunningTime FB YP DBLPsub DBLP 10-2 10-1 10 0 101 Error FB YP DBLPsub DBLP Orders of Magnitude Faster More Accurate F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
  40. 40. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
  41. 41. Conclusion Distributed Deep Learning at Scale Mxnet has many attractive features ◮ Flexible programming ◮ Portable ◮ Highly efficient Easy to deploy large-scale DL on AWS cloud ◮ Deep Learning AMI ◮ Cloud formation templates Tensors are the future of ML Tensor contractions: space savings in deep architectures. New primitives speed up tensor contractions: extended BLAS = ++ + T u v = + ....

×