Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What Would Shannon Do?

1,202 views

Published on

A Talk on Bayesian Compression of Neural Networks at the Deep Learning & Reinforcement Learning Summer School 2017

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

What Would Shannon Do?

  1. 1. >< WHAT WOULD SHANNON DO? BAYESIAN COMPRESSION FOR DL KAREN ULLRICH UNIVERSITY OF AMSTERDAM DEEP LEARNING AND REINFORCEMENT LEARNING SUMMER SCHOOL MONTREAL 2017 1KAREN ULLRICH, JUN 2017
  2. 2. >< Motivation 2KAREN ULLRICH, JUN 201701 MOTIVATION
  3. 3. >< Motivation 3 - 1 Wh costs 0.0225 cent - running a Titan X for 1h: 5.625 cent KAREN ULLRICH, JUN 201701 MOTIVATION - facebook has 1.86 billion active users - VGG takes ~147ms/16 predictions - making one prediction for all users costs 20 k€
  4. 4. >< Motivation - Summary 4 - mobile devices have limited hardware - energy costs for predictions KAREN ULLRICH, JUN 201701 MOTIVATION - bandwidth transmitting models - speeding up inference for real time processing - relation to privacy
  5. 5. ><02 RECENT WORK Practical view on compression A : Sparsity learning • Structured Pruning: • CR: 5KAREN ULLRICH, MAR 2017 A = [ ] IC = [0 1 3 0 1 2] IR = [3, 0, 2, 3, …, 1] |w| |w6=0| • (Unstructured) Pruning • CR: .size = |w| |w6=0|.size = ⇡ |w| 2|w6=0|
  6. 6. ><02 RECENT WORK Practical view on compression B : Bit per weight reduction • precision quantisation 6KAREN ULLRICH, MAR 2017 • Set quantisation by clustering 1 -3.5667 2 0.4545 3 0.8735 4 0.33546 • CR: 32/10 = 3 • PRO: fast inference • CON: savings is not too big ? • CR: 32/4 = 8 • PRO: extreme compressible with e.g. further Hoffman encoding • CON: inference?
  7. 7. ><02 RECENT WORK Practical view on compression Summary - Properties 7KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning Structured pruning Set quantisation Bit quantisation Unstructured pruning - highest compression - flop and energy savings moderate Structured pruning - lowest expected compression - BUT will save considerable amount of flops and thus energy
  8. 8. ><02 RECENT WORK Practical view on compression Summary - Applications 8KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning -“ZIP”-format for NN - transmitting via limited channels - save millions of nets efficiently Structured pruning - inference at scale - real time predictions - hardware limited devices
  9. 9. >< 9 Variational lower bound KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING log p(D) L(q(w), w)) = Eq(w)[log p(D,w) q(w) ] = Eq(w)[log p(D|w)]] KL(q(w)||p(w)) Hinton, Geoffrey E., and Drew Van Camp. "Keeping the neural networks simple by minimizing the description length of the weights." Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993.
  10. 10. >< MDL principle and Variational Learning 10 The best model is the one that compresses the data best. There are two costs, one for transmitting a model and one for reporting the data misfit. KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING Jorma Rissanen, 1978
  11. 11. >< 11 Variational lower bound KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING transmitting the model transmitting data misfit log p(D) L(q(w), w)) = Eq(w)[log p(D,w) q(w) ] = Eq(w)[log p(D|w)]] KL(q(w)||p(w)) Hinton, Geoffrey E., and Drew Van Camp. "Keeping the neural networks simple by minimizing the description length of the weights." Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993.
  12. 12. >< Variational lower bound 12KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING log p(D) L(q(w), w)) = Eq(w)[log p(D,w) q(w) ] = Eq(w)[log p(D|w)]] KL(q(w)||p(w)) Hinton, Geoffrey E., and Drew Van Camp. "Keeping the neural networks simple by minimizing the description length of the weights." Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993.
  13. 13. ><02 RECENT WORK Practical view on compression Summary - Properties 13KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning - highest compression - flop and energy savings moderate Structured pruning - lowest expected compression - BUT will save considerable amount of flops and thus energy
  14. 14. >< 14 • Solution: train a neural network with gaussian mixture model prior Soft weight-sharing for NN compression KAREN ULLRICH, EDWARD MEED & MAX WELLING ICLR 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK q(w) = Y q(wi) = (wi|µi) • Pruning by setting one component to zero with high mixing proportion Nowlan, Steven J., and Geoffrey E. Hinton. "Simplifying neural networks by soft weight-sharing." Neural computation 4.4 (1992): 473-493.
  15. 15. >< 15 Soft weight-sharing for NN compression KAREN ULLRICH, EDWARD MEED & MAX WELLING ICLR 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  16. 16. ><02 RECENT WORK Practical view on compression Summary - Properties 16KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning - highest compression - flop and energy savings moderate Structured pruning - lowest expected compression - BUT will save considerable amount of flops and thus energy
  17. 17. >< Bayesian Compression for Deep Learning 17 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 • Idea: use dropout to learn architecture • the variational version of dropout learns the dropout rate • Solution: Learn dropout rate for each weight structure, when weights have a high dropout rate we can safely ignore them • uncertainty in left over weights to compute bit precision KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK Kingma, Diederik P., Tim Salimans, and Max Welling. "Variational dropout and the local reparameterization trick." NIPS. 2015. Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. "Variational Dropout Sparsifies Deep Neural Networks." arXiv preprint arXiv:1701.05369 (2017).
  18. 18. >< Bayesian Compression for Deep Learning 18 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 push to zero for high dropout rates force high dropout rates KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK q(w|z) = Y q(wi|zi) = N(wi|ziµi, z2 i 2 i ) q(z) = Y q(zi) = N(zi|µz i , ↵i)
  19. 19. >< Bayesian Compression for Deep Learning 19 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK p(w) = Z p(z)p(w|z)dz q(w|z) = Y q(wi|zi) = N(wi|ziµi, z2 i 2 i ) q(z) = Y q(zi) = N(zi|µz i , ↵i)
  20. 20. >< Bayesian Compression for Deep Learning 20 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  21. 21. >< Bayesian Compression for Deep Learning 21 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  22. 22. >< Bayesian Compression for Deep Learning 22 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  23. 23. >< Warning: Don’t be too enthusiastic! 23 These algorithms are merely proposals, little can be realised by common frameworks today. • Architecture pruning • Sparse matrix support (partially in big frameworks) • Reduced bit precision (NVIDIA is starting) • Clustering KAREN ULLRICH, JUN 201705 WARNING
  24. 24. >< 24KAREN ULLRICH, JUN 2017 Thank you for your attention. Any questions? KARENULLRICH.INFO _KARRRN_

×