Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Trinity of AI: data, algorithms and cloud

497 views

Published on

AI at scale requires a perfect storm of data, algorithms and cloud infrastructure. Modern deep learning requires large amounts of training data. We develop methods that improve data collection, aggregation and augmentation. This involves active learning, partial feedback, crowdsourcing, and generative models. We analyze large-scale machine learning methods for distributed training. We show that gradient quantization can yield best of both the worlds: accuracy and communication efficiency. We extend matrix methods to higher dimensions using tensor algebraic techniques and show superior performance. Finally, at AWS, we are developing robust software frameworks and AI cloud services at all levels of the stack.

Published in: Technology
  • Be the first to comment

Trinity of AI: data, algorithms and cloud

  1. 1. ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & CLOUD Anima Anandkumar Principal Scientist at Amazon AI Bren Professor at Caltech
  2. 2. ALGORITHMS DATAINFRASTRUCTURE TRINITY FUELING ARTIFICIAL INTELLIGENCE
  3. 3. ALGORITHMS DATA • COLLECTION • AGGREGATION • AUGMENTATION INFRASTRUCTURE TRINITY FUELING ARTIFICIAL INTELLIGENCE
  4. 4. ALGORITHMS • OPTIMIZATION • SCALABILITY • MULTI-DIMENSIONALITY DATA • COLLECTION • AGGREGATION • AUGMENTATION INFRASTRUCTURE TRINITY FUELING ARTIFICIAL INTELLIGENCE
  5. 5. ALGORITHMS • OPTIMIZATION • SCALABILITY • MULTI-DIMENSIONALITY DATA • COLLECTION • AGGREGATION • AUGMENTATION INFRASTRUCTURE ML STACK ON CLOUD • APPLICATION SERVICES • ML PLATFORM • ML FRAMEWORKS TRINITY FUELING ARTIFICIAL INTELLIGENCE
  6. 6. • COLLECTION: ACTIVE LEARNING, PARTIAL LABELS.. • AGGREGATION: CROWDSOURCING MODELS.. • AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS.. DATA
  7. 7. ACTIVE LEARNING Labeled data Unlabeled data Goal • Reach SOTA with a smaller dataset • Active learning analyzed in theory • In practice, only small classical models Can it work at scale with deep learning?
  8. 8. TASK: NAMED ENTITY RECOGNITION
  9. 9. RESULTS NER task on largest open benchmark (Onto-notes) Active learning heuristics: • Least confidence (LC) • Max. normalized log probability (MNLP) • Deep active learning matches : • SOTA with just 25% data on English, 30% on Chinese. • Best shallow model (on full data) with 12% data on English, 17% on Chinese. Test F1 score vs. % of labeled words English 0 20 40 60 80 70 75 80 85 Percent of words annotated TestF1score MNLP LC RAND Best Deep Model Best Shallow Model 0 20 40 60 80 100 65 70 75 MNLP LC RAND Best Deep Model Best Shallow Model Chinese
  10. 10. • Uncertainty sampling works. Normalizing for length helps under low data. • With active learning, deep beats shallow even in low data regime. • With active learning, SOTA achieved with far fewer samples. TAKE-AWAY
  11. 11. ACTIVE LEARNING WITH PARTIAL FEEDBACK images questions dog? dog non-dog partial labels • Hierarchical class labeling: Labor proportional to # of binary questions asked • Actively pick informative questions ?
  12. 12. RESULTS ON TINY IMAGENET (100K SAMPLES) • Yield 8% higher accuracy at 30% questions (w.r.t. Uniform) • Obtain full annotation with 40% less binary questions ALPF-ERC active data active questions Uniform inactive data inactive questions AL-ME active data inactive questions AQ-ERC inactive data active questions 0.1 0.2 0.3 0.4 0.5 0% 25% 50% 75% 100% Accuracy vs. # of Questions Uniform AL-ME AQ-ERC ALPF-ERC - 40% +8%
  13. 13. • Don’t annotate from scratch • Select questions actively based on the learned model • Don’t sleep on partial labels • Re-train model from partial labels TWO TAKE-AWAYS
  14. 14. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers.
  15. 15. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers. Annotator-quality models • Can improve accuracy. • Hard: needs to be estimated without ground-truth.
  16. 16. SOME INTUITIONS Majority rule to estimate annotator quality • Justification: Majority rule approaches ground-truth when enough workers. • Downside: Requires large number of annotations for each example for majority rule to be correct. Annotator quality model (Prob. of correctness)
  17. 17. OUR APPROACH • Use ML model being trained as ground-truth. • Use annotator quality model to train ML model with weighted loss. • Justification: As iterations proceed, ML model agrees more with correct workers. • Works with single annotation per sample! Annotator quality model (Prob. of correctness) ML model to estimate ground truth
  18. 18. LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE MS-COCO dataset. Fixed budget: 35k annotations No. of workers Theorem: Under fixed budget, generalization error minimized with single annotation per sample. Assumptions: • Best predictor is accurate enough (under no label noise). • Simplified case: All workers have same quality. • Prob. of being correct > 83% 5% wrt Majority rule
  19. 19. DATA AUGMENTATION • To improve generalization, augment training data. • Mostly in computer vision. Simple techniques: rotation, cropping etc. • More sophisticated approaches ?
  20. 20. DATA AUGMENTATION Merits • Captures statistics of natural images • Learnable Peril • Quality of generated images not high • Introduces artifacts Our GAN-based framework – Mr.GANs – narrows gap between synthetic and real data GAN Merits • High-quality rendering • Full annotation for free • Generate infinite data PerilSynthetic Data • Domain mis-match • Rendering for visual appeal and not classification
  21. 21. MIXED-REALITY GENERATIVE ADVERSARIAL NETWORKS (MR-GAN) Improvement over training on real data: • Real + Synthetic: 5.43% (Stage 0) • Real + CycleGAN on synthetic: 5.09% (Stage 1) • Real + Mr.GAN on synthetic: 8.85% (Stage 2) CIFAR-DA 4 : ~1million synthetic from 3D models.~0.25% real from CIFAR100. 4 classes. Stage 1 Stage 2 : Mapping at stage 1 of Mr.GANs : Mapping at stage 2 of Mr.GANs D(P(real-2) || P(synt-2)) < D(P(real-1) || P(synt-1)) < D(P(real) || P(synt))
  22. 22. DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS Goal: Learn a domain of functions (sin, cos, log, add…) • Training on numerical input-output does not generalize. Data Augmentation with Symbolic Expressions • Efficiently encode relationships between functions. Solution: • Design networks to use both: symbolic + numerical
  23. 23. ARCHITECTURE : TREE LSTM sin$ % + cos$ % = 1 sin −2.5 = −0.6 • Symbolic expression trees. Function evaluation tree. • Decimal trees: encode numbers with decimal representation (numerical). • Can encode any expression, function evaluation and number. Decimal Tree for 2.5
  24. 24. RESULTS • Vastly Improved numerical evaluation: 90% over function-fitting baseline. • Generalization to verifying symbolic equations of higher depth • Combining symbolic + numerical data helps in better generalization for both tasks: symbolic and numerical evaluation. LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric 76.40 % 93.27 % 96.17 %
  25. 25. • OPTIMIZATION : ANALYSIS OF CONVERGENCE • SCALABILITY : GRADIENT QUANTIZATION • MULTI-DIMENSIONALITY : TENSOR ALGEBRA ALGORITHMS
  26. 26. OPTIMIZATION IN DEEP LEARNING •Data •Deep, layered model with lots of parameters •Descend gradient of objective function
  27. 27. THE DEEP LEARNING COOKBOOK •Data •Deep, layered model with lots of parameters •Descend gradient of objective function Randomlyaugmenteddata with {insert arbitrary} regularizer applied Running average of rescaled, clipped stochastic gradients dropped out/batch normalized/bypassed
  28. 28. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data
  29. 29. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data Compress? Compress? Compress?
  30. 30. DISTRIBUTED TRAINING BY MAJORITY VOTE Parameter server GPU 1 GPU 2 GPU 3 sign(g) sign(g) sign(g) Parameter server GPU 1 GPU 2 GPU 3 sign [sum(sign(g))]
  31. 31. SIGN QUANTIZATION COMPETITIVE ON IMAGENET Communication gain without performance loss • Signum: Sign over momentum. • Accuracy of Signum similar to ADAM.
  32. 32. SGD THEORY — Gradient vector — Lipscitz scalar (curvature) — Variance scalar — Gradient density — Lipscitz density — Variance density signSGD THEORY • SignSGD wins : Function mostly flat (sparse curvature vector). • SGD wins : Function has sparse gradients. 2 L g (g) (~L) (~)
  33. 33. DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate: Same variance reduction as SGD
  34. 34. TAKE-AWAYS FOR SIGN-SGD • Convergence even under biased gradients and noise. • Faster than SGD under sparse curvature directions. • For distributed training, similar variance reduction as SGD. • In practice, similar accuracy as Adam but with far less communication.
  35. 35. KNOWLEDGE DISTILLATION • Knowledge Distillation : conveys “Dark knowledge” from teacher to student
  36. 36. 1 0 Ground Truth 1 0 Model Outputs Output Values Contribution to Cross Entropy Loss Only correct category contributes to loss function. Ground-truth baseline INTUITIONS
  37. 37. 1 0 Teacher Outputs 1 Student Outputs Output Values Contribution to Cross Entropy Loss 0 • All categories in teacher output contributes to loss. • If teacher is highly accurate and confident, it is similar to original labels. KNOWLEDGE DISTILLATION What is the role of wrong categories?
  38. 38. 1 0 Ground Truth 1 Student Outputs Output Values Loss with High Confidence Teacher 0 Loss with Low Confidence Teacher CONFIDENCE WEIGHTED BY TEACHER MAX Contribution of max category is isolated.
  39. 39. 1 0 Permuted Teacher Outputs 1 Student Outputs Output Values Contribution to Cross Entropy Loss 0 Permutation destroys pairwise correlations. But some higher-order information is invariant. DARK KNOWLEDGE WITH PERMUTED PREDICTIONS
  40. 40. BORN-AGAIN NETWORKS Knowledge Distillation between identical neural network architectures Step 1Step 0 Step 2
  41. 41. RESULTS ON CIFAR 100 • Best results on CIFAR-100 using simple KD on Densenet (BAN). •Multiple BAN stages helped for smaller networks. •Similar behavior in other experiments: Resnet to Densenet, Penn tree bank. • CWTM worse than DKPP: Information about higher-order moments help. Test accuracy on Densenet Born Again: Student surpasses teacher !
  42. 42. TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world Modern data is inherently multi-dimensional
  43. 43. UNSUPERVISED LEARNING OF TOPIC MODELS THROUGH TENSOR METHODS Extracting Topics from Documents Justice Education Sports Topics
  44. 44. TENSOR-BASED LDA TRAINING IS FASTER 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 5 10 15 20 25 30 50 75 100 Timeinminutes Number of Topics Training time for NYTimes Spectral Time(minutes) Mallet Time (minutes) 0.00 50.00 100.00 150.00 200.00 250.00 5 10 15 20 25 50 100 Timeinminutes Number of Topics Training time for PubMed Spectral Time (minutes) Mallet Time (minutes) 8 million documents 22x faster on average 12x faster on average • Mallet is an open-source framework for topic modeling • Benchmarks on AWS SageMaker Platform • Bulit into AWS Comprehend NLP service. 300000 documents
  45. 45. DEEP NEURAL NETS: TRANSFORMING TENSORS
  46. 46. DEEP TENSORIZED NETWORKS
  47. 47. SPACE SAVING IN DEEP TENSORIZED NETWORKS
  48. 48. Tensor Train RNN and LSTMs TENSORS FOR LONG-TERM FORECASTING Challenges: • Long-term dependencies • High-order correlations • Error propagation
  49. 49. Climate datasetTraffic dataset TENSOR LSTM FOR LONG-TERM FORECASTING
  50. 50. TENSORLY: HIGH-LEVEL API FOR TENSOR ALGEBRA • Python programming • User-friendly API • Multiple backends: flexible + scalable • Example notebooks in repository
  51. 51. Frameworks Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Greengrass) Vision: Rekognition Image Rekognition Video Speech: Polly Transcribe Language: Lex Translate Comprehend Apache MXNet PyTorch Cognitive Toolkit Keras Caffe2 & Caffe TensorFlow Gluon INFRASTRUCTURE AND FRAMEWORKS ON AWS Application Services Platform Services Amazon Machine Learning Mechanical Turk Spark & EMR Amazon SageMaker AWS DeepLens
  52. 52. Fully managed hosting with auto-scaling One-click deployment Pre-built notebooks for common problems Built-in, high performance algorithms One-click training Hyperparameter optimization BUILD TRAIN DEPLOY AMAZON SAGEMAKER SOLVES THESE PROBLEMS
  53. 53. A New Vision for Autonomy Center for Autonomous Systems and Technologies
  54. 54. CAST: BRINGING DRONES, ROBOTICS AND AI TOGETHER
  55. 55. • DATA • Collection: Active learning and partial feedback • Aggregation: Crowdsourcing models • Augmentation: Graphics rendering + GANs, Symbolic expressions • ALGORITHMS • Convergence: SignSGD has good rates in theory and practice • Scalability: SignSGD has same variance reduction as SGD for multi-machine • Multi-dimensionality: Tensor algebra for neural networks and probabilistic models. • INFRASTRUCTURE: • Frameworks: Tensorly is high-level API for deep tensorized networks. • Platform: SageMaker is ML platform with auto parallelization CONCLUSION AI needs integration of data, algorithms and infrastructure
  56. 56. COLLABORATORS (LIMITED LIST)
  57. 57. Thank you

×