Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Probabilistic modeling in deep learning

2,234 views

Published on

Probabilistic modeling in deep learning

Published in: Data & Analytics
  • Slim Down in Just 1 Minute? What if I told you, you've been lied to for nearly all of your life? CLICK HERE TO SEE THE TRUTH  https://tinyurl.com/1minweight
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL. BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL. BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Probabilistic modeling in deep learning

  1. 1. Probabilistic modeling in Deep Learning Dzianis Dus Lead Data Scientist at InData Labs
  2. 2. How we will spend the next 60 minutes? In thinking about the following topics:
  3. 3. In thinking about the following topics: 1. What does “probabilistic modeling” means? How we will spend the next 60 minutes?
  4. 4. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? How we will spend the next 60 minutes?
  5. 5. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? 3. How we can use it to build: How we will spend the next 60 minutes?
  6. 6. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? 3. How we can use it to build: a. More robust and powerful models How we will spend the next 60 minutes?
  7. 7. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? 3. How we can use it to build: a. More robust and powerful models b. Models with predefined properties How we will spend the next 60 minutes?
  8. 8. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? 3. How we can use it to build: a. More robust and powerful models b. Models with predefined properties c. Models without overfitting (o_O) How we will spend the next 60 minutes?
  9. 9. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? 3. How we can use it to build: a. More robust and powerful models b. Models with predefined properties c. Models without overfitting (o_O) d. Infinite ensembles of models (o_O) How we will spend the next 60 minutes?
  10. 10. In thinking about the following topics: 1. What does “probabilistic modeling” means? 2. Why it is cool (sometimes)? 3. How we can use it to build: a. More robust and powerful models b. Models with predefined properties c. Models without overfitting (o_O) d. Infinite ensembles of models (o_O) 4. Deep Learning How we will spend the next 60 minutes?
  11. 11. Problem statement: Empirical way Suppose that we want to solve classical regression problem:
  12. 12. Problem statement: Empirical way Suppose that we want to solve classical regression problem: Typical approach:
  13. 13. Problem statement: Empirical way Suppose that we want to solve classical regression problem: Typical approach: 1. Choose functional family for F(...) 2. Choose appropriate loss function 3. Choose optimization algorithm 4. Minimize loss on (X, Y) 5. ...
  14. 14. Problem statement: Empirical way Suppose that we want to solve classical regression problem: Typical approach: 1. Choose functional family for F(...) 2. Choose appropriate loss function 3. Choose optimization algorithm 4. Minimize loss on (X, Y) 5. ...
  15. 15. Problem statement: Probabilistic way Define “probability model” (describes how your data was generated):
  16. 16. Problem statement: Probabilistic way Define “probability model” (describes how your data was generated): Having model you can calculate “likelihood” of your data:
  17. 17. Problem statement: Probabilistic way Define “probability model” (describes how your data was generated): Having model you can calculate “likelihood” of your data: We are working with i.i.d. data
  18. 18. Problem statement: Probabilistic way Define “probability model” (describes how your data was generated): Having model you can calculate “likelihood” of your data: Sharing the same variance
  19. 19. Problem statement: Probabilistic way Data log-likelihood: Maximum likelihood estimation:
  20. 20. Problem statement: Probabilistic way Data log-likelihood: Maximum likelihood estimation: MSE Loss minimization
  21. 21. Problem statement: Probabilistic way Data log-likelihood: Maximum likelihood estimation: MSE Loss minimization For i.i.d. data sharing the same variance!
  22. 22. Problem statement: Probabilistic way
  23. 23. Problem statement: Probabilistic way Log-Likelihood maximization = Empirical loss minimization
  24. 24. Problem statement: Probabilistic way 1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables Empirical loss minimizationLog-Likelihood maximization =
  25. 25. Problem statement: Probabilistic way 1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables 2. For each empirically stated problem exists appropriate probability model Empirical loss minimizationLog-Likelihood maximization =
  26. 26. Problem statement: Probabilistic way 1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables 2. For each empirically stated problem exists appropriate probability model 3. Empirical loss is often just a particular case of wider probability model Empirical loss minimizationLog-Likelihood maximization =
  27. 27. Problem statement: Probabilistic way 1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables 2. For each empirically stated problem exists appropriate probability model 3. Empirical loss is often just a particular case of wider probability model 4. Wider model = wider opportunities! Empirical loss minimizationLog-Likelihood maximization =
  28. 28. Probabilistic modeling: Wider opportunities for Flo Suppose that we have: 1. N unique users in the training set 2. For each user we’ve collected time series of user states (on daily basis): 3. For each user we’ve collected time series of cycles lengths: 4. We predict time series of lengths Y based on time series of states X
  29. 29. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood:
  30. 30. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Probability that user i will have cycle with length y at day j
  31. 31. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Just another notationProbability that user i will have cycle with length y at day j
  32. 32. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Cycle length of user i at day j has Gaussian distribution
  33. 33. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Parameters of distribution at day j depends on model parameters and all features up to day j
  34. 34. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Can be easily modeled with deep RNN!
  35. 35. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Can be easily modeled with deep RNN! Note that:
  36. 36. Probabilistic modeling: Wider opportunities for Flo We want to maximize data likelihood: Can be easily modeled with deep RNN! Note that: We don’t need any labels to predict variance!
  37. 37. Probabilistic modeling: Wider opportunities for Flo Real life example:
  38. 38. Parameter estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. © Wikipedia
  39. 39. Parameter estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. © Wikipedia Commonly used estimators: ● Maximum likelihood estimator (MLE) - the Ugly ● Maximum a posteriori estimator (MAP) - the Bad ● Bayesian estimator - the Good
  40. 40. Parameter estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. © Wikipedia Commonly used estimators: ● Maximum likelihood estimator (MLE) - the Ugly ● Maximum a posteriori estimator (MAP) - the Bad ● Bayesian estimator - the Good We are here
  41. 41. Parameter estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. © Wikipedia Commonly used estimators: ● Maximum likelihood estimator (MLE) - the Ugly ● Maximum a posteriori estimator (MAP) - the Bad ● Bayesian estimator - the Good The way we go
  42. 42. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator:
  43. 43. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists:
  44. 44. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule:
  45. 45. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule: Posterior distribution over model parameters
  46. 46. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule: Data likelihood for specific parameters (could be modeled with Deep Network!)
  47. 47. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule: Prior distribution over parameters (describes our prior knowledge or / and our desires for the model)
  48. 48. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule: Bayesian evidence
  49. 49. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule: Bayesian evidence A powerful method for model selection!
  50. 50. Maximum a posteriori estimator Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists: Then we can apply Bayes Rule: As a rule this integral is intractable :( (You can never integrate this)
  51. 51. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator:
  52. 52. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator: Doesn’t depend on model parameters
  53. 53. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator:
  54. 54. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator: The only (but powerful!) difference from MLE
  55. 55. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator: 1. MAP estimates model parameters as mode of posterior distribution
  56. 56. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator: 1. MAP estimates model parameters as mode of posterior distribution 2. MAP estimation with non-informative prior = MLE
  57. 57. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator: 1. MAP estimates model parameters as mode of posterior distribution 2. MAP estimation with non-informative prior = MLE 3. MAP restricts the search space of possible models
  58. 58. Maximum a posteriori estimator The core idea of Maximum a Posteriori Estimator: 1. MAP estimates model parameters as mode of posterior distribution 2. MAP estimation with non-informative prior = MLE 3. MAP restricts the search space of possible models 4. With MAP you can put restrictions not only on model weights but also on many interactions inside the network
  59. 59. Probabilistic modeling: Regularization Regularization - is a process of introducing additional information in order to solve an ill-posed problem or prevent overfitting. © Wikipedia
  60. 60. Probabilistic modeling: Regularization Regularization - is a process of introducing additional information in order to solve an ill-posed problem or prevent overfitting. © Wikipedia Regularization - is a process of introducing additional information in order to restrict model to have predefined properties.
  61. 61. Probabilistic modeling: Regularization Regularization - is a process of introducing additional information in order to solve an ill-posed problem or prevent overfitting. © Wikipedia Regularization - is a process of introducing additional information in order to restrict model to have predefined properties. It is closely connected to “prior distributions” on weights / activations / …
  62. 62. Probabilistic modeling: Regularization Regularization - is a process of introducing additional information in order to solve an ill-posed problem or prevent overfitting. © Wikipedia Regularization - is a process of introducing additional information in order to restrict model to have predefined properties. It is closely connected to “prior distributions” on weights / activations / … … and to MAP estimation!
  63. 63. Probabilistic modeling: Regularization Weights decay (or L2 regularization):
  64. 64. Probabilistic modeling: Regularization Weights decay (or L2 regularization): Appropriate probability model: Model log-likelihood:
  65. 65. Probabilistic modeling: Regularization
  66. 66. Probabilistic modeling: Regularization
  67. 67. Probabilistic modeling: Regularization Data log-likelihood (we’ve already calculated this)
  68. 68. Probabilistic modeling: Regularization Doesn’t depend on model parameters
  69. 69. Probabilistic modeling: Regularization Squared L2 norm of parameters
  70. 70. Probabilistic modeling: Regularization Regularization constant
  71. 71. Probabilistic modeling: Regularization So, it is clear that:
  72. 72. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization
  73. 73. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization 2. It can be shown that Dropout is also a form of particular probability model …
  74. 74. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization 2. It can be shown that Dropout is also a form of particular probability model … 3. … a Bayesian one :) …
  75. 75. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization 2. It can be shown that Dropout is also a form of particular probability model … 3. … a Bayesian one :) … 4. … and therefore can be used not only as a regularization technique!
  76. 76. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization 2. It can be shown that Dropout is also a form of particular probability model … 3. … a Bayesian one :) … 4. … and therefore can be used not only as a regularization technique! 5. Do you want to pack your network weights into few kilobytes?
  77. 77. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization 2. It can be shown that Dropout is also a form of particular probability model … 3. … a Bayesian one :) … 4. … and therefore can be used not only as a regularization technique! 5. Do you want to pack your network weights into few kilobytes? 6. Ok, all you need - is MAP!
  78. 78. Probabilistic modeling: Regularization 1. Laplace distribution as a prior = L1 regularization 2. It can be shown that Dropout is also a form of particular probability model … 3. … a Bayesian one :) … 4. … and therefore can be used not only as a regularization technique! 5. Do you want to pack your network weights into few kilobytes? 6. Ok, all you need - is MAP! MAP - is all you need!
  79. 79. Weights packing: Empirical way Song Han and others - Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (2015) Modern neural networks could be dramatically compressed:
  80. 80. Weights packing: Soft-Weight Sharing 1. Define prior distribution of weights as Gaussian Mixture Model
  81. 81. 1. Define prior distribution of weights as Gaussian Mixture Model Mixture of Gaussians = Weights packing: Soft-Weight Sharing
  82. 82. 1. Define prior distribution of weights as Gaussian Mixture Model 2. For one of the Gaussian components force: Weights packing: Soft-Weight Sharing
  83. 83. 1. Define prior distribution of weights as Gaussian Mixture Model 2. For one of the Gaussian components force: 3. Maybe define Gamma prior for variances (for numerical stability) Weights packing: Soft-Weight Sharing
  84. 84. 1. Define prior distribution of weights as Gaussian Mixture Model 2. For one of the Gaussian components force: 3. Maybe define Gamma prior for variances (for numerical stability) 4. Just find MAP estimation for both model parameters and free mixture parameters! Weights packing: Soft-Weight Sharing
  85. 85. Karen Ullrich - Soft Weight-Sharing For Neural Network Compression (2017) Weights packing: Soft-Weight Sharing
  86. 86. Karen Ullrich - Soft Weight-Sharing For Neural Network Compression (2017) Weights packing: Soft-Weight Sharing
  87. 87. Maximum a posteriori estimation 1. Pretty cool and powerful technique 2. You can build hierarchical models (put priors on priors of priors of…) 3. You can put priors on activations of layers (sparse autoencoders) 4. Leads to “Empirical Bayes” 5. Thinking how to restrict your model? Try to find appropriate prior!
  88. 88. True Bayesian Modeling: Recap
  89. 89. True Bayesian Modeling: Recap 1. Posterior could be easily found in case of conjugate distributions
  90. 90. True Bayesian Modeling: Recap 1. Posterior could be easily found in case of conjugate distributions 2. But for most real life models denominator is intractable
  91. 91. True Bayesian Modeling: Recap 1. Posterior could be easily found in case of conjugate distributions 2. But for most real life models denominator is intractable 3. In MAP denominator is totally ignored
  92. 92. True Bayesian Modeling: Recap 1. Posterior could be easily found in case of conjugate distributions 2. But for most real life models denominator is intractable 3. In MAP denominator is totally ignored 4. Can we find a good approximation of the posterior?
  93. 93. True Bayesian Modeling: Approximation Two main ideas:
  94. 94. True Bayesian Modeling: Approximation Two main ideas: 1. MCMC (Monte Carlo Markov Chain)
  95. 95. True Bayesian Modeling: Approximation Two main ideas: 1. MCMC (Monte Carlo Markov Chain) - a tricky one
  96. 96. True Bayesian Modeling: Approximation Two main ideas: 1. MCMC (Monte Carlo Markov Chain) - a tricky one 2. Variational Inference
  97. 97. True Bayesian Modeling: Approximation Two main ideas: 1. MCMC (Monte Carlo Markov Chain) - a tricky one 2. Variational Inference - a “Black Magic” one
  98. 98. True Bayesian Modeling: Approximation Two main ideas: 1. MCMC (Monte Carlo Markov Chain) - a tricky one 2. Variational Inference - a “Black Magic” one Another ideas exists: 1. Monte Carlo Dropout 2. Stochastic gradient langevin dynamics 3. ...
  99. 99. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution
  100. 100. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution 2. Then you can burn-in Markov Chain (convergence to equilibrium) and then sample from the posterior distribution
  101. 101. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution 2. Then you can burn-in Markov Chain (convergence to equilibrium) and then sample from the posterior distribution 3. Sounds tricky, but it is well-defined procedure
  102. 102. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution 2. Then you can burn-in Markov Chain (convergence to equilibrium) and then sample from the posterior distribution 3. Sounds tricky, but it is well-defined procedure 4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python
  103. 103. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution 2. Then you can burn-in Markov Chain (convergence to equilibrium) and then sample from the posterior distribution 3. Sounds tricky, but it is well-defined procedure 4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python 5. Unfortunately, it is not scalable
  104. 104. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution 2. Then you can burn-in Markov Chain (convergence to equilibrium) and then sample from the posterior distribution 3. Sounds tricky, but it is well-defined procedure 4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python 5. Unfortunately, it is not scalable 6. So, you can’t explicitly apply it to complex models (like Neural Networks)
  105. 105. True Bayesian Modeling: MCMC 1. Key idea is to construct Markov Chain which has posterior distribution as its equilibrium distribution 2. Then you can burn-in Markov Chain (convergence to equilibrium) and then sample from the posterior distribution 3. Sounds tricky, but it is well-defined procedure 4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python 5. Unfortunately, it is not scalable 6. So, you can’t explicitly apply it to complex models (like Neural Networks) 7. But implicit scaling is possible: Bayesian learning via stochastic gradient langevin dynamics (2011)
  106. 106. True Bayesian Modeling: Variational Inference True posterior:
  107. 107. True Bayesian Modeling: Variational Inference True posterior: Modeled with Deep Neural Network
  108. 108. True Bayesian Modeling: Variational Inference True posterior: Intractable integral :(
  109. 109. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation:
  110. 110. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation:
  111. 111. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation: Explicitly define distribution family for approximation (e.g. multivariate gaussian)
  112. 112. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation: Variational parameters (e.g. mean vector, covariance matrix)
  113. 113. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation: Speaking mathematically:
  114. 114. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation: Speaking mathematically: Kullback-Leibler divergence (measure of distributions dissimilarity)
  115. 115. True Bayesian Modeling: Variational Inference True posterior: Let’s find good approximation: Speaking mathematically: True posterior is unknown :(
  116. 116. Achtung! A lot of math is coming!
  117. 117. True Bayesian Modeling: Variational Inference
  118. 118. True Bayesian Modeling: Variational Inference
  119. 119. True Bayesian Modeling: Variational Inference
  120. 120. True Bayesian Modeling: Variational Inference Rewrite this using Bayes rule:
  121. 121. True Bayesian Modeling: Variational Inference
  122. 122. True Bayesian Modeling: Variational Inference Doesn’t depend on theta! (After integration) Parameters of integration
  123. 123. True Bayesian Modeling: Variational Inference So, it is a constant!
  124. 124. True Bayesian Modeling: Variational Inference
  125. 125. True Bayesian Modeling: Variational Inference Has no effect on minimization problem
  126. 126. True Bayesian Modeling: Variational Inference
  127. 127. True Bayesian Modeling: Variational Inference Group this together
  128. 128. True Bayesian Modeling: Variational Inference
  129. 129. True Bayesian Modeling: Variational Inference Multiply by (-1)
  130. 130. True Bayesian Modeling: Variational Inference
  131. 131. True Bayesian Modeling: Variational Inference KL divergence
  132. 132. True Bayesian Modeling: Variational Inference
  133. 133. True Bayesian Modeling: Variational Inference It is an expectation over q(...)
  134. 134. True Bayesian Modeling: Variational Inference
  135. 135. True Bayesian Modeling: Variational Inference Equivalent problems!
  136. 136. True Bayesian Modeling: Variational Inference Equivalent problems! Likelihood of your data (your Neural Network works here!)
  137. 137. True Bayesian Modeling: Variational Inference Equivalent problems! Prior on network weights (you define this!)
  138. 138. True Bayesian Modeling: Variational Inference Equivalent problems! Approximate posterior (you define the form of this!)
  139. 139. True Bayesian Modeling: Variational Inference Equivalent problems! We want to optimize this wrt of approximate posterior parameters!
  140. 140. True Bayesian Modeling: Variational Inference Equivalent problems! We need to calculate the gradient of this
  141. 141. True Bayesian Modeling: Variational Inference Gradient calculation:
  142. 142. True Bayesian Modeling: Variational Inference Gradient calculation:
  143. 143. True Bayesian Modeling: Variational Inference Gradient calculation: Rewrite this as expectation (for convenience)
  144. 144. True Bayesian Modeling: Variational Inference Gradient calculation:
  145. 145. True Bayesian Modeling: Variational Inference Gradient calculation: Ooops...
  146. 146. True Bayesian Modeling: Variational Inference
  147. 147. True Bayesian Modeling: Variational Inference Modeled with Deep Network!
  148. 148. True Bayesian Modeling: Variational Inference This integral is intractable too :( (God damn!)
  149. 149. True Bayesian Modeling: Variational Inference If it was just q(...) then we can calculate approximation using Monte Carlo method!
  150. 150. True Bayesian Modeling: Variational Inference
  151. 151. True Bayesian Modeling: Variational Inference This is just = 1!
  152. 152. True Bayesian Modeling: Variational Inference This is gradient of log(q(...))!
  153. 153. True Bayesian Modeling: Variational Inference
  154. 154. True Bayesian Modeling: Variational Inference
  155. 155. True Bayesian Modeling: Variational Inference Luke, log derivative trick!
  156. 156. True Bayesian Modeling: Variational Inference Luke, log derivative trick!
  157. 157. True Bayesian Modeling: Variational Inference Can be approximated with Monte Carlo! Luke, log derivative trick!
  158. 158. True Bayesian Modeling: Variational Inference Luke, log derivative trick!
  159. 159. Bayesian Networks: Step by step Define functional family for approximate posterior (e.g. Gaussian):
  160. 160. Bayesian Networks: Step by step Define functional family for approximate posterior (e.g. Gaussian): Solve optimization problem (with doubly stochastic gradient ascend):
  161. 161. Bayesian Networks: Step by step Define functional family for approximate posterior (e.g. Gaussian): Solve optimization problem (with doubly stochastic gradient ascend): Having approximate posterior you can sample network weights (as much as you want)!
  162. 162. Bayesian Networks: Pros and Cons As a result you have: 1. Infinite ensemble of Neural Networks! 2. No overfit problem (in classical sense)! 3. No adversarial examples problem! 4. Measure of prediction confidence! 5. ...
  163. 163. Bayesian Networks: Pros and Cons As a result you have: 1. Infinite ensemble of Neural Networks! 2. No overfit problem (in classical sense)! 3. No adversarial examples problem! 4. Measure of prediction confidence! 5. ... No free hunch: 1. A lot of work is still hidden in “scalability” and “convergence”! 2. Very (very!) expensive predictions!
  164. 164. Bayesian Networks Examples: BRNN Meire Fortunato and others - Bayesian Recurrent Neural Networks (2017)
  165. 165. Bayesian Networks Examples: SegNet Alex Kendall and others - Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
  166. 166. Bayesian Networks Examples: SegNet Alex Kendall and others - Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
  167. 167. Bayesian Networks Examples: SegNet Alex Kendall and others - Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
  168. 168. Bayesian Networks Examples: SegNet Alex Kendall and others - Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
  169. 169. Bayesian Networks Examples: SegNet Alex Kendall and others - Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
  170. 170. Bayesian Networks in (near) Production: UBER Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017) How it works: 1. LSTM network 2. Monte Carlo Dropout 3. Daily complete trips prediction 4. Anomaly detection for various metrics
  171. 171. Bayesian Networks in (near) Production: UBER Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017) How it works: 1. LSTM network 2. Monte Carlo Dropout 3. Daily complete trips prediction 4. Anomaly detection for various metrics
  172. 172. Bayesian Networks in (near) Production: UBER Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017) How it works: 1. LSTM network 2. Monte Carlo Dropout 3. Daily complete trips prediction 4. Anomaly detection for various metrics
  173. 173. Bayesian Networks in (near) Production: Flo Predicted distributions of cycle length for 40 independent users: Switched to Empirical Bayes for now.
  174. 174. Speech Summary 1. Probabilistic modeling is a powerful tool with strong math background
  175. 175. Speech Summary 1. Probabilistic modeling is a powerful tool with strong math background 2. Many techniques are currently not widely used in Deep Learning
  176. 176. Speech Summary 1. Probabilistic modeling is a powerful tool with strong math background 2. Many techniques are currently not widely used in Deep Learning 3. You can improve many aspects of your model using the same framework
  177. 177. Speech Summary 1. Probabilistic modeling is a powerful tool with strong math background 2. Many techniques are currently not widely used in Deep Learning 3. You can improve many aspects of your model using the same framework 4. Scalability, stability of convergence and inference cost are main constraints
  178. 178. Speech Summary 1. Probabilistic modeling is a powerful tool with strong math background 2. Many techniques are currently not widely used in Deep Learning 3. You can improve many aspects of your model using the same framework 4. Scalability, stability of convergence and inference cost are main constraints 5. The future of Deep Learning looks Bayesian...
  179. 179. Speech Summary 1. Probabilistic modeling is a powerful tool with strong math background 2. Many techniques are currently not widely used in Deep Learning 3. You can improve many aspects of your model using the same framework 4. Scalability, stability of convergence and inference cost are main constraints 5. The future of Deep Learning looks Bayesian... … (for the moment, for me)
  180. 180. Thank you for your ! I hope, you have a lot of questions :) (attention) Dzianis Dus Lead Data Scientist at InData Labs

×