Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimensional

269 views

Published on

As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. SignSGD is a gradient compression algorithm that only transmits the sign of the stochastic gradients during distributed training. This algorithm uses 32 times less communication per iteration than distributed SGD. We show that signSGD obtains free lunch both in theory and practice: no loss in accuracy while yielding speedups. Pushing the current boundaries of deep learning also requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. These functionalities are available in the Tensorly package with multiple backend interfaces for large-scale deep learning.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimensional

  1. 1. Anima Anandkumar MODERN ML: DEEP, DISTRIBUTED, MULTI-DIMENSIONAL
  2. 2. 2 TRINITY OF AI DATACOMPUTE ALGORITHMS
  3. 3. 3 MOORE’S LAW: A SUPERCHARGED LAW  More than a billion operations per image.  NVIDIA GPUs enable parallel operations.  Enables Large-Scale AI. COMPUTE INFRASTRUCTURE FOR AI: GPU
  4. 4. 4 DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data
  5. 5. 5 DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data Compress? Compress? Compress?
  6. 6. 6 DISTRIBUTED TRAINING BY MAJORITY VOTE Parameter server GPU 1 GPU 2 GPU 3 sign(g) sign(g) sign(g) Parameter server GPU 1 GPU 2 GPU 3 sign [sum(sign(g))] Jeremy Bernstein, Jiawei Zhao, Kamyar Azzizadenesheli, Yu-Xiang Wang, A
  7. 7. 7 SIGNSGD PROVIDES “FREE LUNCH" Throughput gain with almost same accuracy P3.2x machines on AWS, Resnet50 on imagenet
  8. 8. 8 SIGNSGD ACROSS DOMAINS AND ARCHITECTURES Huge throughput gain!
  9. 9. 9 SIGNSGD IS BYZANTINE FAULT TOLERANT SignSGD is robust
  10. 10. 10 TAKE-AWAYS FOR SIGN-SGD • Convergence even under biased gradients and noise. • Faster convergence than SGD in theory and in practice. • For distributed training, similar variance reduction as SGD. • In practice, similar accuracy but with far less communication. https://github.com/PermiJW/signSGD-with-Majority-Vote Pytorch code at
  11. 11. 11 TENSORS: MULTI-DIMENSIONAL PROCESSING Image: 3 dimensions Width * Height * Channels Video: 4 dimensions Width * Height * Channels * Time
  12. 12. 12 TENSOR : EXTENSION OF MATRIX
  13. 13. 13 OPERATIONS ON TENSORS: TENSOR CONTRACTION
  14. 14. 14 DEEP NEURAL NETS: TRANSFORMING TENSORS
  15. 15. 15 DEEP TENSORIZED NETWORKS Jean Kossaifi, Zack Chase Lipton, Aran Khanna, Tommaso Furlanello, A Pytorch notebook: https://github.com/JeanKossaifi/tensorly-notebooks
  16. 16. 16 SPACE SAVING IN DEEP TENSORIZED NETWORKS
  17. 17. 17 T E N S O R L Y : H I G H - L E V E L A P I F O R T E N S O R A L G E B R A • Python programming • User-friendly API • Multiple backends: flexible + scalable • Example notebooks in repository
  18. 18. 18 TENSORS: TOPIC DETECTION IN TEXT Co-occurrence of word triplets Topic 1 Topic 2 STORM WORLD SERIES AUSTRALIA STOCK MARKET WASHINGTON HEALTH CRISIS MACHINE LEARNING LIBRARY OF NEWS ARTICLES Amazon Comprehend LIST OF TOPICS
  19. 19. 19 UNSUPERVISED LEARNING OF TOPIC MODELS THROUGH TENSOR METHODS Justice Educatio n Sports Topics
  20. 20. 20 TENSOR-BASED LDA TRAINING IS FASTER • Mallet is an open-source framework for topic modeling • Benchmarks on AWS SageMaker Platform • Bulit into AWS Comprehend NLP service. 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 5 10 15 20 25 30 50 75 100 Timeinminutes Number of Topics Training time for NYTimes Spectral Time(minutes) Mallet Time (minutes) 0.00 50.00 100.00 150.00 200.00 250.00 5 10 15 20 25 50 100 Timeinminutes Number of Topics Training time for PubMed Spectral Time (minutes) Mallet Time (minutes) 8 million documents 22x faster on average 12x faster on average 300000 documents
  21. 21. A New Vision for Autonomy Center for Autonomous Systems and Technologies
  22. 22. 22 CAST @ CALTECH DRONE TESTING LAB
  23. 23. 23 CAST @ CALTECH LEARNING TO LAND
  24. 24. 24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. RESEARCH LEADERS AT NVIDIA Robotics Dieter Fox Learning & Perception Jan KautzBill Dally Dave Luebke Alex Keller Aaron Lefohn Graphics Steve Keckler Dave Nellans Mike O’Connor ArchitectureProgramming Michael Garland VLSI Brucek Khailany Circuits Tom Gray Networks Larry Dennison Chief Scientist Computer vision Core ML Sanja Fidler Me ! Applied research Bryan Catanzaro

×