Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intel® AI: Parameter Efficient Training

169 views

Published on

In this presentation, we describe a heuristic for modifying the structure of sparse deep convolutional networks during training. The heuristic allows us to train sparse networks directly to reach accuracies on par with accuracies obtained through compressing/pruning of big dense models. We show that exploring the network structure during training is essential to reach best accuracies, even when the optimal network structure is known a-priori.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Intel® AI: Parameter Efficient Training

  1. 1. PARAMETEREFFICIENTTRAININGOFDEEP CONVOLUTIONAL NEURALNETWORKSBY DYNAMICSPARSEREPARAMETERIZATION Hesham Mostafa & Xin Wang Office of the CTO, Artificial Intelligence Products Group, Intel corporation
  2. 2. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Main results 2 We developed a training method for deep sparse convolutional neural networks (CNNs) that: ▪ Uses structure exploration to directly train sparse and compact CNNs to similar levels of accuracy as sparse CNNs obtained by pruning large dense models ▪ For residual architectures, yields the best performing CNNs for a given training-time parameter budget ▪ We show that training-time structure exploration is crucial to effective learning of sparse networks and that it is often more effective to add structural degrees of freedom than to add extra parameters
  3. 3. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Obtaining compact models Current wisdom : behind every successful small model, there is a big model label provide target 32 bits 2 bits update weights qua ntize distillation weight quantization prune spa rsifica tion 3
  4. 4. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Gradient descent in higher dimensions High-cost local minima become more unlikely in higher dimensions. Saddle points dominate [Dauphin et al., ‘Identifying and attacking the saddle point problem in high-dimensional non-convex optimization’, NIPS ’14] [Choromanska et al. , ‘The loss surface of multi-layer networks’, AISTATS ’15] 4
  5. 5. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Training-time structure exploration prune grow Motivation: Maybe SGD alone can not find good loss minima in small networks. Augment SGD with heuristics to modify network structure during training 5
  6. 6. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Details of the structure exploration scheme 6 ▪ Start with a random sparse network ▪ Every few hundred mini-batches: 1. Prune small weights based on an adaptive threshold Motivation: Unimportant weights receive small loss gradients. They will be pulled to zero by L2 weight decay. Get rid of them 2. Grow the same number of pruned weights elsewhere. Favor allocating weights to layers that are less sparse (rich gets richer) Motivation: Sparser layers have less important parameters (since most of them got pruned). Do not allocate many weights to them
  7. 7. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. The baselines 1. ‘Pruned networks’: sparse networks obtained by iteratively and slowly pruning a large, dense network. 2. ‘Thin dense networks’ : Dense networks having smaller layer dimensions (to maintain same number of parameters as the our networks) 3. ‘Static sparse networks’ : Sparse networks where the sparsity pattern is randomly initialized and fixed 4. ‘DeepR’1: Prior structure exploration scheme based on a random walk in parameter space 5. ‘SET’2 : Prior structure exploration scheme based on alternating prune and growth phases. Does not move parameters between layers 7 1. Bellec, Guillaume, et al. "Deep rewiring: Training very sparse deep networks." arXiv preprint arXiv:1711.05136 (2017). 2. Mocanu, Decebal Constantin, et al. "Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science." Nature communications 9.1 (2018): 2383.
  8. 8. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. WideResNet-28-2 on CIFAR10 • Our directly trained sparse networks slightly exceed the performance of pruned models • Our directly trained sparse networks significantly outperform static networks with same number of parameters 741 Number of parameters in sparse models (K) [Sergey Zagoruyko and Nikos Komodakis, ‘Wide residual networks’, arXiv 2016] 91 92 93 94 95 Testaccuracy(%) full-sized model pruned model dynamic sparse 161 306 451 596 thin dense static sparse DeepR SET 0.9 0.6 0.5 Global sparsity 0.8 0.7 (s)(a) 8
  9. 9. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Resnet-50 on imagenet (ours) pruned SET model thin dense DeepR static sparse 7 2 7 0 6 8 7 4 Top-1accuracy Full model 7.3M params (s=0.8) 5.1M params (s=0.9) • Our method converges to non-uniform sparsity patterns • Outperforms previous structure exploration methods that had fixed per-layer sparsity 9
  10. 10. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Is knowing the structure enough to directly train high-performance sparse networks? For sparse networks discovered through pruning, evidence1,2 suggests that the weights of these sparse network can be re-initialized (while keeping the structure) and trained to high accuracy 1.Jonathan Frankle and Michael Carbin, ‘The lottery ticket hypothesis: Training pruned neural networks’, arXiv 2018 2. Liu, Zhuang, et al. "Rethinking the value of network pruning." arXiv preprint arXiv:1810.05270 (2018). train and prune copy structure and reinitialize train similar accuracy 10
  11. 11. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Can a network with static structure match the accuracy of our training scheme? • Train a sparse network using our method • Maintain the final network structure. Reinitialize weights. Train the network while keeping structure fixed SGD train + our structure exploration method copy structure a n d reinitialize train a ccuracy ? ? ? 11
  12. 12. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Training-time structure exploration is crucial High performance of sparse networks discovered by our method can not be matched by static-connectivity networks copying the sparse final structure, or the sparse final structure and its weight initialization Global sparsity Testaccuracy% 92 94 96 97 0.9 0.8 91 93 95 (a ) Global sparsity Top-1testaccuracy% 70 74 76 0.9 0.8 68 72 (b) Random structure, random initialization Discovered structure, random initialiation Discovered structure, original initialziation Dynamic sparse Random structure, random initialization Discovered structure, random initialiation Discovered structure, original initialziation Dynamic sparse WRN-28-2 on CIFAR10 Resnet50 on imagenet 12
  13. 13. © Intel Corporation Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Summary 13 • Our method trains networks under a strict memory budget. Challenges the ‘train big then compress’ approach • Given a training-time memory budget, we are better off using part of this budget to specify and explore connectivity than to spend it all on conventional weights • Our results strengthen the case for native hardware support of sparse operations as training can be done directly in the sparse domain. Training and inference can operate solely on sparse tensors for a reduced parameter memory footprint
  14. 14. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2019 Intel Corporation. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

×