Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Tamara G. Kolda, Distinguished Memb... by MLconf 1022 views
- Rushin Shah, Engineering Manager, F... by MLconf 834 views
- Dr. June Andrews, Principal Data Sc... by MLconf 742 views
- Daniel Shank, Data Scientist, Talla... by MLconf 676 views
- Jonas Schneider, Head of Engineerin... by MLconf 1175 views
- Doug Eck, Research Scientist, Googl... by MLconf 783 views

1,716 views

Published on

Modern machine learning involves deep neural network architectures which yields state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache MXNet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed settings.

Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional information in data end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with MXNet backend interface for large-scale efficient learning.

Bio: Anima Anandkumar is a principal scientist at Amazon Web Services and a Bren professor at Caltech CMS department. Her research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. She is the recipient of several awards such as the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, Google research award, ARO and AFOSR Young Investigator Awards, NSF Career Award, Early Career Excellence in Research Award at UCI, Best Thesis Award from the ACM Sigmetrics society, IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums such as the yourstory, Quora ML session, O’Reilly media, and so on. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, an assistant professor at U.C. Irvine between 2010 and 2016, and a visiting researcher at Microsoft Research New England in 2012 and 2014.

Published in:
Technology

No Downloads

Total views

1,716

On SlideShare

0

From Embeds

0

Number of Embeds

30

Shares

0

Downloads

54

Comments

0

Likes

4

No embeds

No notes for slide

- 1. Learning at Scale: Deep, Distributed and Multi-dimensional Anima Anandkumar .. Amazon AI & Caltech
- 2. Significantly improve many applications on multiple domains “deep learning” trend in the past 10 years image understanding speech recognition natural language processing … Deep Learning autonomy
- 3. Image Classification Layer 1 Layer 2 Output multilevel feature extractions from raw pixels to semantic meanings explore spatial information with convolution layers
- 4. Image Classification § Hard to define the network § the definition of the inception network has >1k lines of codes in Caffe § A single image requires billions floating-point operations § Intel i7 ~500 GFLOPS § Nvidia Titan X: ~5 TFLOPS § Memory consumption is linear with number of layers State-of-the-art networks have tens to hundreds layers
- 5. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
- 6. 3. MXNet image credit - wikipedia • Imperative and Declarative Programming • Language Support • Backend and Automatic Parallelization
- 7. Writing Parallel Programs is Painful Each forward-backward-update involves O(num_layer), which is often 100—1,000, tensor computations and communications data = next_batch()data[gpu0].copyfrom(data[0:50]) _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu0] , fc1_weight[gpu0]) fc1_ograd[gpu0], fc2_wgrad[gpu0] = FullcBackward(fc2_ograd[gpu0] , fc2_weight[gpu0]) fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50]) fc2[gpu0] = FullcForward(fc1[gpu0], fc2_weight[gpu0]) fc1[gpu0] = FullcForward(data[gpu0], fc1_weight[gpu0]) fc2_wgrad[cpu] = fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc2_weight[cpu].copyto( fc2_weight[gpu0] , fc2_weight[gpu1]) fc2_weight[cpu] -= lr*fc12_wgrad[gpu0] fc1_weight[cpu] -= lr * fc1_wgrad[gpu0] fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_weight[cpu].copyto( fc1_weight[gpu0] , fc1_weight[gpu1]) data[gpu0].copyfrom(data[51:100]) _, fc1_wgrad[gpu1] = FullcBackward(fc1_ograd[gpu1] , fc1_weight[gpu1]) fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] , fc2_weight[gpu1]) fc2_ograd[gpu1] = LossGrad(fc2[gpu1], label[51:100]) fc2[gpu1] = FullcForward(fc1[gpu1], fc2_weight[gpu1]) fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1]) Dependency graph for 2-layer neural networks with 2 GPUs
- 8. Auto Parallelization 18 Write serial programs Run in parallel >>> import mxnet as mx >>> A = mx.nd.ones((2,2)) *2 >>> C = A + 2 >>> B = A + 1 >>> D = B * C >>> D.wait_to_read() A = 2 C = A + 2 B = A + 1 D = B ⨉ C
- 9. Data Parallelism 19 key-value store examples 1. Read a data partition 2. Pull the parameters 3. Compute the gradient 4. Push the gradient 5. Update the parameters
- 10. Scale to Multiple GPU Machines 21 PCIe Switch GPU GPU GPU GPU CPU Network Switch 63 GB/s 4 PCIe 3.0 16x 15.75 GB/s PCIe 3.0 16x 1.25 GB/s 10 Gbit Ethernet Hierarchical parameter server Level-1 Servers Workers Level-2 Servers GPUs CPUs
- 11. Experiment Setup ✧ ✓ 1.2 million images with 1000 classes ✧ Resnet 152-layer model ✧ EC2 P2.16xlarge 22 GPU 0-15 PCIe switches CPU ✧ Minibatch SGD ✧ Synchronized Updating
- 12. Scalability over Multiple Machines 23 time(sec)/bath 0 0.25 0.5 0.75 1 # of GPUs 0 32 64 96 128 Comm Cost batch size/GPU=2 batch size/GPU=4 batch size/GPU=8 batch size/GPU=16 115x
- 13. 8 2012before 2013 2014 2015 2016 2017 mxnet imperative symbolic gluon
- 14. Back-end System ✧ Optimization ✓ Memory optimization ✓ Operator fusion ✧ Scheduling ✓ Auto-parallelization 11 a b 1 + ⨉ c fullc softmax weight bias Back-end import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1 import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward() Front-end
- 15. In summary ✦ Symbolic ❖ eﬃcient & portable ❖ but hard to use 10 ✦ tesla ✦ Imperative ❖ ﬂexible ❖ may be slow ✦ Gluon ❖ imperative for developing ❖ symbolic for deploying
- 16. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
- 17. Tensors: Beyond 2D world Modern data is inherently multi-dimensional
- 18. Tensors: Beyond 2D world Modern data is inherently multi-dimensional Input Hidden 1 Hidden 2 Output
- 19. Tensor Contraction Extends the notion of matrix product Matrix product Mv = j vjMj = + Tensor Contraction T(u, v, ·) = i,j uivjTi,j,: = ++ +
- 20. Employing Tensor Contractions in Alexnet Replace fully connected layer with tensor contraction layer
- 21. Enabling Tensor Contraction Layer in Mxnet
- 22. Performance of the TCL • Trained end-to-end • On ImageNet with VGG: • 65.9% space savings • performance drop of 0.6% only • On ImageNet with AlexNet: • 56.6% space savings • Performance improvement of 0.5%
- 23. Low-rank tensor regression Tensor Regression Networks, J. Kossaifi, Z.C.Lipton, A.Khanna, T.Furlanello and A.Anandkumar, ArXiv pre-publication
- 24. Performance and rank
- 25. Speeding up Tensor Contractions 1 Tensor contractions are a core primitive of multilinear algebra. 2 BLAS 3: Unbounded compute intensity (no. of ops per I/O) Consider single-index contractions: CC = AA BB = = A(:,1,:) A(:,2,:)A422 B21 C421 e.g. Cmnp = Amnk Bkp
- 26. Speeding up Tensor Contraction Explicit permutation dominates, especially for small tensors. Consider Cmnp = Akm Bpkn. 1 Akm → Amk 2 Bpkn → Bkpn 3 Cmnp → Cmpn 4 Cm(pn) = Amk Bk(pn) 5 Cmpn → Cmnp 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 n (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions.
- 27. Existing Primitives GEMM Suboptimal for many small matrices. Pointer-to-Pointer BatchedGEMM Available in MKL 11.3β and cuBLAS 4.1 C[p] = α op(A[p]) op(B[p]) + β C[p] cublas<T>gemmBatched(cublasHandle_t handle, cublasOperation_t transA, cublasOperation_t transB, int M, int N, int K, const T* alpha, const T** A, int ldA, const T** B, int ldB, const T* beta, T** C, int ldC, int batchCount)
- 28. Tensor Contraction with Extended BLAS Primitives Cmn[p] = AmkBkn[p] cublasDgemmStridedBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, &alpha, A, ldA1, 0, B, ldB1, ldB2, &beta, C, ldC1, ldC2, P)
- 29. Tensor Contraction with Extended BLAS Primitives Cmnp = A∗∗ × B∗∗∗ Cmnp ≡ C[m + n · ldC1 + p · ldC2] Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2 1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] 4.1 AknBkmp Cmn[p] = Bkm[p]Akn 1.2 AmkBkpn Cmn[p] = AmkBk[p]n Cm[n]p = AmkBkp[n] 4.2 AknBkpm Cmn[p] = Bk[p]mAkn 1.3 AmkBnkp Cmn[p] = AmkBnk[p] 4.3 AknBmkp Cmn[p] = Bmk[p]Akn 1.4 AmkBpkn Cm[n]p = AmkBpk[n] 4.4 AknBpkm 1.5 AmkBnpk Cm(np) = AmkB(np)k Cmn[p] = AmkBn[p]k 4.5 AknBmpk Cmn[p] = Bm[p]kAkn 1.6 AmkBpnk Cm[n]p = AmkBp[n]k 4.6 AknBpmk 2.1 AkmBknp Cm(np) = AkmBk(np) Cmn[p] = AkmBkn[p] 5.1 ApkBkmn C(mn)p = Bk(mn)Apk Cm[n]p = Bkm[n]Apk 2.2 AkmBkpn Cmn[p] = AkmBk[p]n Cm[n]p = AkmBkp[n] 5.2 ApkBknm Cm[n]p = Bk[n]mApk 2.3 AkmBnkp Cmn[p] = AkmBnk[p] 5.3 ApkBmkn Cm[n]p = Bmk[n]Apk 2.4 AkmBpkn Cm[n]p = AkmBpk[n] 5.4 ApkBnkm 2.5 AkmBnpk Cm(np) = AkmB(np)k Cmn[p] = AkmBn[p]k 5.5 ApkBmnk C(mn)p = B(mn)kApk Cm[n]p = Bm[n]kApk 2.6 AkmBpnk Cm[n]p = AkmBp[n]k 5.6 ApkBnmk 3.1 AnkBkmp Cmn[p] = Bkm[p]Ank 6.1 AkpBkmn C(mn)p = Bk(mn)Akp Cm[n]p = Bkm[n]Akp 3.2 AnkBkpm Cmn[p] = Bk[p]mAnk 6.2 AkpBknm Cm[n]p = Bk[n]mAkp 3.3 AnkBmkp Cmn[p] = Bmk[p]Ank 6.3 AkpBmkn Cm[n]p = Bmk[n]Akp 3.4 AnkBpkm 6.4 AkpBnkm 3.5 AnkBmpk Cmn[p] = Bm[p]kAnk 6.5 AkpBmnk C(mn)p = B(mn)kAkp Cm[n]p = Bm[n]kAkp 3.6 AnkBpmk 6.6 AkpBnmk
- 30. A new primitive: StridedBatchedGEMM Performance on par with pure GEMM (P100 and beyond).
- 31. Applications: Tucker Decomposition Tmnp = GijkAmiBnjCpk mnp ijk mi njT G A B pkC Main steps in the algorithm Ymjk = TmnpBt njCt pk Yink = TmnpAt+1 mi Ct pk Yijp = TmnpBt+1 nj At+1 mi Performance on Tucker decomposition: 20 40 60 80 100 120 10−2 100 102 104 106 n Time(sec) TensorToolbox BTAS Cyclops CPU Batched GPU Batched
- 32. Tensor Sketches Randomized dimensionality reduction through sketching. ◮ Complexity independent of tensor order: exponential gain! +1 +1 -1 Tensor T Sketch s Applications Tensor Decomposition via Sketching Visual Question and Answering CNN RNN What is the mustach made of? C W H MCT L Avgpooling FC Relu BatchNorm FC "Banana" Softmax
- 33. MCT in Visual Question & Answering CNN RNN ¡t is the musta ¢¡£¤ ¥¦§ C ¨ r w© v ev FC Relu f ! m FC 4#$%$na4 Softma x
- 34. Multimodal Tensor Pooling C W H L Text feature Image feature d1 d2 d3Spatial sketch Count sketch 3D FFT 1D FFT 3D IFFT (optional) d4 d1 d2 d3
- 35. Tensor Decompositions
- 36. Extracting Topics from Documents Topics Topic Proportion police witness campus police witness campus police witness campus police witness crime Sports Educaon campus A., D. P. Foster, D. Hsu, S.M. Kakade, Y.K. Liu.“Two SVDs Suﬃce: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation,” NIPS 2012.
- 37. Tensor Methods for Topic Modeling campus police witness Topic-word matrix P[word = i|topic = j] Linearly independent columns Moment Tensor: Co-occurrence of Word Triplets = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness
- 38. Tensors vs. Variational Inference Criterion: Perplexity = exp[−likelihood]. Learning Topics from PubMed on Spark, 8mil articles 0 2 4 6 8 10 ×104 RunningTime 103 104 105 Perplexity Tensor Variational Learning network communities from social network data Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6. 102 10 3 10 4 105 10 6 RunningTime FB YP DBLPsub DBLP 10-2 10-1 10 0 101 Error FB YP DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
- 39. Tensors vs. Variational Inference Criterion: Perplexity = exp[−likelihood]. Learning Topics from PubMed on Spark, 8mil articles 0 2 4 6 8 10 ×104 RunningTime 103 104 105 Perplexity Tensor Variational Learning network communities from social network data Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6. 102 10 3 10 4 105 10 6 RunningTime FB YP DBLPsub DBLP 10-2 10-1 10 0 101 Error FB YP DBLPsub DBLP Orders of Magnitude Faster More Accurate F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
- 40. Outline 1 Introduction 2 Distributed Deep Learning Using Mxnet 3 Learning in Multiple Dimensions 4 Conclusion
- 41. Conclusion Distributed Deep Learning at Scale Mxnet has many attractive features ◮ Flexible programming ◮ Portable ◮ Highly eﬃcient Easy to deploy large-scale DL on AWS cloud ◮ Deep Learning AMI ◮ Cloud formation templates Tensors are the future of ML Tensor contractions: space savings in deep architectures. New primitives speed up tensor contractions: extended BLAS = ++ + T u v = + ....

No public clipboards found for this slide

Be the first to comment