Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

What Would Shannon Do? Slide 1 What Would Shannon Do? Slide 2 What Would Shannon Do? Slide 3 What Would Shannon Do? Slide 4 What Would Shannon Do? Slide 5 What Would Shannon Do? Slide 6 What Would Shannon Do? Slide 7 What Would Shannon Do? Slide 8 What Would Shannon Do? Slide 9 What Would Shannon Do? Slide 10 What Would Shannon Do? Slide 11 What Would Shannon Do? Slide 12 What Would Shannon Do? Slide 13 What Would Shannon Do? Slide 14 What Would Shannon Do? Slide 15 What Would Shannon Do? Slide 16 What Would Shannon Do? Slide 17 What Would Shannon Do? Slide 18 What Would Shannon Do? Slide 19 What Would Shannon Do? Slide 20 What Would Shannon Do? Slide 21 What Would Shannon Do? Slide 22 What Would Shannon Do? Slide 23 What Would Shannon Do? Slide 24
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

3 Likes

Share

Download to read offline

What Would Shannon Do?

Download to read offline

A Talk on Bayesian Compression of Neural Networks at the Deep Learning & Reinforcement Learning Summer School 2017

What Would Shannon Do?

  1. 1. >< WHAT WOULD SHANNON DO? BAYESIAN COMPRESSION FOR DL KAREN ULLRICH UNIVERSITY OF AMSTERDAM DEEP LEARNING AND REINFORCEMENT LEARNING SUMMER SCHOOL MONTREAL 2017 1KAREN ULLRICH, JUN 2017
  2. 2. >< Motivation 2KAREN ULLRICH, JUN 201701 MOTIVATION
  3. 3. >< Motivation 3 - 1 Wh costs 0.0225 cent - running a Titan X for 1h: 5.625 cent KAREN ULLRICH, JUN 201701 MOTIVATION - facebook has 1.86 billion active users - VGG takes ~147ms/16 predictions - making one prediction for all users costs 20 k€
  4. 4. >< Motivation - Summary 4 - mobile devices have limited hardware - energy costs for predictions KAREN ULLRICH, JUN 201701 MOTIVATION - bandwidth transmitting models - speeding up inference for real time processing - relation to privacy
  5. 5. ><02 RECENT WORK Practical view on compression A : Sparsity learning • Structured Pruning: • CR: 5KAREN ULLRICH, MAR 2017 A = [ ] IC = [0 1 3 0 1 2] IR = [3, 0, 2, 3, …, 1] |w| |w6=0| • (Unstructured) Pruning • CR: .size = |w| |w6=0|.size = ⇡ |w| 2|w6=0|
  6. 6. ><02 RECENT WORK Practical view on compression B : Bit per weight reduction • precision quantisation 6KAREN ULLRICH, MAR 2017 • Set quantisation by clustering 1 -3.5667 2 0.4545 3 0.8735 4 0.33546 • CR: 32/10 = 3 • PRO: fast inference • CON: savings is not too big ? • CR: 32/4 = 8 • PRO: extreme compressible with e.g. further Hoffman encoding • CON: inference?
  7. 7. ><02 RECENT WORK Practical view on compression Summary - Properties 7KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning Structured pruning Set quantisation Bit quantisation Unstructured pruning - highest compression - flop and energy savings moderate Structured pruning - lowest expected compression - BUT will save considerable amount of flops and thus energy
  8. 8. ><02 RECENT WORK Practical view on compression Summary - Applications 8KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning -“ZIP”-format for NN - transmitting via limited channels - save millions of nets efficiently Structured pruning - inference at scale - real time predictions - hardware limited devices
  9. 9. >< 9 Variational lower bound KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING log p(D) L(q(w), w)) = Eq(w)[log p(D,w) q(w) ] = Eq(w)[log p(D|w)]] KL(q(w)||p(w)) Hinton, Geoffrey E., and Drew Van Camp. "Keeping the neural networks simple by minimizing the description length of the weights." Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993.
  10. 10. >< MDL principle and Variational Learning 10 The best model is the one that compresses the data best. There are two costs, one for transmitting a model and one for reporting the data misfit. KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING Jorma Rissanen, 1978
  11. 11. >< 11 Variational lower bound KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING transmitting the model transmitting data misfit log p(D) L(q(w), w)) = Eq(w)[log p(D,w) q(w) ] = Eq(w)[log p(D|w)]] KL(q(w)||p(w)) Hinton, Geoffrey E., and Drew Van Camp. "Keeping the neural networks simple by minimizing the description length of the weights." Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993.
  12. 12. >< Variational lower bound 12KAREN ULLRICH, JUN 201703 MDL PRINCIPLE + VARIATIONAL LEARNING log p(D) L(q(w), w)) = Eq(w)[log p(D,w) q(w) ] = Eq(w)[log p(D|w)]] KL(q(w)||p(w)) Hinton, Geoffrey E., and Drew Van Camp. "Keeping the neural networks simple by minimizing the description length of the weights." Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993.
  13. 13. ><02 RECENT WORK Practical view on compression Summary - Properties 13KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning - highest compression - flop and energy savings moderate Structured pruning - lowest expected compression - BUT will save considerable amount of flops and thus energy
  14. 14. >< 14 • Solution: train a neural network with gaussian mixture model prior Soft weight-sharing for NN compression KAREN ULLRICH, EDWARD MEED & MAX WELLING ICLR 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK q(w) = Y q(wi) = (wi|µi) • Pruning by setting one component to zero with high mixing proportion Nowlan, Steven J., and Geoffrey E. Hinton. "Simplifying neural networks by soft weight-sharing." Neural computation 4.4 (1992): 473-493.
  15. 15. >< 15 Soft weight-sharing for NN compression KAREN ULLRICH, EDWARD MEED & MAX WELLING ICLR 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  16. 16. ><02 RECENT WORK Practical view on compression Summary - Properties 16KAREN ULLRICH, MAR 2017 Set quantisation Bit quantisation Unstructured pruning - highest compression - flop and energy savings moderate Structured pruning - lowest expected compression - BUT will save considerable amount of flops and thus energy
  17. 17. >< Bayesian Compression for Deep Learning 17 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 • Idea: use dropout to learn architecture • the variational version of dropout learns the dropout rate • Solution: Learn dropout rate for each weight structure, when weights have a high dropout rate we can safely ignore them • uncertainty in left over weights to compute bit precision KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK Kingma, Diederik P., Tim Salimans, and Max Welling. "Variational dropout and the local reparameterization trick." NIPS. 2015. Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. "Variational Dropout Sparsifies Deep Neural Networks." arXiv preprint arXiv:1701.05369 (2017).
  18. 18. >< Bayesian Compression for Deep Learning 18 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 push to zero for high dropout rates force high dropout rates KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK q(w|z) = Y q(wi|zi) = N(wi|ziµi, z2 i 2 i ) q(z) = Y q(zi) = N(zi|µz i , ↵i)
  19. 19. >< Bayesian Compression for Deep Learning 19 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK p(w) = Z p(z)p(w|z)dz q(w|z) = Y q(wi|zi) = N(wi|ziµi, z2 i 2 i ) q(z) = Y q(zi) = N(zi|µz i , ↵i)
  20. 20. >< Bayesian Compression for Deep Learning 20 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  21. 21. >< Bayesian Compression for Deep Learning 21 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  22. 22. >< Bayesian Compression for Deep Learning 22 CHRISTOS LOUIZOS, KAREN ULLRICH & MAX WELLING UNDER SUBMISSION NIPS 2017 KAREN ULLRICH, JUN 201704 CONTRIBUTED WORK
  23. 23. >< Warning: Don’t be too enthusiastic! 23 These algorithms are merely proposals, little can be realised by common frameworks today. • Architecture pruning • Sparse matrix support (partially in big frameworks) • Reduced bit precision (NVIDIA is starting) • Clustering KAREN ULLRICH, JUN 201705 WARNING
  24. 24. >< 24KAREN ULLRICH, JUN 2017 Thank you for your attention. Any questions? KARENULLRICH.INFO _KARRRN_
  • moaddeli

    Jul. 10, 2017
  • fly51fly

    Jul. 1, 2017
  • XiaohuZHU

    Jul. 1, 2017

A Talk on Bayesian Compression of Neural Networks at the Deep Learning & Reinforcement Learning Summer School 2017

Views

Total views

1,928

On Slideshare

0

From embeds

0

Number of embeds

149

Actions

Downloads

19

Shares

0

Comments

0

Likes

3

×