Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Auto-Encoders and PCA, a brief psychological background

1,300 views

Published on

A Psychological background on how we think and store memory to explain the motivation behind the Autoencoders and then comparing the performance, in terms of reconstruction error, of the PCA against the Autoencoders.

Published in: Science
  • Be the first to comment

Auto-Encoders and PCA, a brief psychological background

  1. 1. Auto-Encoders and PCA, a brief psychological background Self-taught Learning
  2. 2. •How do Humans Learn? And why not replicating that? •How do babies think? Long Term Slide 2 of 77
  3. 3. •“We might expect that babies would have really powerful learning mechanisms. And in fact, the baby's brain seems to be the most powerful learning computer on the planet. •But real computers are actually getting to be a lot better. And there's been a revolution in our understanding of machine learning recently. And it all depends on the ideas of this guy, the Reverend Thomas Bayes, who was a statistician and mathematician in the 18th century.” Alison Gopnik is an American professor of psychology and affiliate professor of philosophy at the University of California, Berkeley. How do babies think Slide 3 of 77
  4. 4. •“And essentially what Bayes did was to provide a mathematical way using probability theory to characterize, describe, the way that scientists find out about the world. •So what scientists do is they have a hypothesis that they think might be likely to start with. They go out and test it against the evidence. •The evidence makes them change that hypothesis. Then they test that new hypothesis and so on and so forth.” Alison Gopnik is an American professor of psychology and affiliate professor of philosophy at the University of California, Berkeley. How do babies think Slide 4 of 77
  5. 5. •푃휔푋∝푃푋휔∗푃(휔) •Posterior ∝ Likelihood * Prior •If this is how our brain work, why not continue in this way ! Bayes’ Theorem Slide 5 of 77
  6. 6. •푃휔푋∝푃푋휔∗푃(휔) Bayes’ Theorem – Issues Slide 6 of 77
  7. 7. •푃휔푋∝푃푋휔∗푃(휔) •To build the likelihood, we need tons of data (The Law of Large Numbers) Bayes’ Theorem – Issues Slide 6 of 77
  8. 8. •푃휔푋∝푃푋휔∗푃(휔) •To build the likelihood, we need tons of data (The Law of Large Numbers) •Not any data, labeled data ! Bayes’ Theorem – Issues Slide 6 of 77
  9. 9. •푃휔푋∝푃푋휔∗푃(휔) •To build the likelihood, we need tons of data (The Law of Large Numbers) •Not any data, labeled data ! •We need to solve for features. Bayes’ Theorem – Issues Slide 6 of 77
  10. 10. •푃휔푋∝푃푋휔∗푃(휔) •To build the likelihood, we need tons of data (The Law of Large Numbers) •Not any data, labeled data ! •We need to solve for features. •How should we decide on which features to use ? Bayes’ Theorem – Issues Slide 6 of 77
  11. 11. Vision Example Slide 11 of 77
  12. 12. Vision Example Slide 12 of 77
  13. 13. Vision Example Slide 13 of 77
  14. 14. Vision Example Slide 14 of 77
  15. 15. Vision Example Slide 15 of 77
  16. 16. Feature Representation – Vision Slide 16 of 77
  17. 17. Feature Representation – Audio Slide 17 of 77
  18. 18. Feature Representation – NLP Slide 18 of 77
  19. 19. The “One Learning Algorithm” Hypothesis Slide 19 of 77
  20. 20. The “One Learning Algorithm” Hypothesis Slide 20 of 77
  21. 21. The “One Learning Algorithm” Hypothesis Slide 21 of 77
  22. 22. On Computer Perception •The Adult visual system computes an incredibly complicated function of the input. Slide 22 of 77
  23. 23. On Computer Perception •The Adult visual system computes an incredibly complicated function of the input. •We can try to implement most of this incredibly complicated function (hand- engineer features) Slide 22 of 77
  24. 24. On Computer Perception •The Adult visual system computes an incredibly complicated function of the input. •We can try to implement most of this incredibly complicated function (hand- engineer features) •OR, we can learn this function instead. Slide 22 of 77
  25. 25. Self-taught Learning Slide 23 of 77
  26. 26. Self-taught Learning Slide 23 of 77
  27. 27. First Stage of Visual Processing – V1 Slide 24 of 77
  28. 28. Feature Learning via Sparse Coding • , 푋(2),…, 푋(푚) (each in 푅푛∗푛 ) • , Φ2,…, Φ푘 (also in 푅푛∗푛 ), so that each input X can be approximately decomposed as: • 푎푗휑푗 푘푗 =1 , s.t. 푎푗 are mostly zero (“sparse”) Slide 25 of 77
  29. 29. Feature Learning via Sparse Coding •Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). • , 푋(2),…, 푋(푚) (each in 푅푛∗푛 ) • , Φ2,…, Φ푘 (also in 푅푛∗푛 ), so that each input X can be approximately decomposed as: • 푎푗휑푗 푘푗 =1 , s.t. 푎푗 are mostly zero (“sparse”) Slide 25 of 77
  30. 30. Feature Learning via Sparse Coding •1) 푋 (1) , 푋 (2) 푋푋 푋 (2) (2) 푋 (2) ,…, 푋 (푚) 푋푋 푋 (푚) (푚푚) 푋 (푚) (each in 푅 푛∗푛 푅푅 푅 푛∗푛 푛푛∗푛푛 푅 푛∗푛 ) •Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). •Input: Images 푋 ( (1) (1) , 푋(2),…, 푋(푚) (each in 푅푛∗푛 ) • , Φ2,…, Φ푘 (also in 푅푛∗푛 ), so that each input X can be approximately decomposed as: • 푎푗휑푗 푘푗 =1 , s.t. 푎푗 are mostly zero (“sparse”) Slide 25 of 77
  31. 31. Feature Learning via Sparse Coding • Φ 1 1 Φ 1 , Φ 2 Φ Φ 2 2 Φ 2 ,…, Φ 푘 Φ Φ 푘 푘푘 Φ 푘 (also in 푅 푛∗푛 푅푅 푅 푛∗푛 푛푛∗푛푛 푅 푛∗푛 ), so that each input X can be approximately decomposed as: •1) 푋 (1) , 푋 (2) 푋푋 푋 (2) (2) 푋 (2) ,…, 푋 (푚) 푋푋 푋 (푚) (푚푚) 푋 (푚) (each in 푅 푛∗푛 푅푅 푅 푛∗푛 푛푛∗푛푛 푅 푛∗푛 ) •Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). •Learn: Dictionary of bases Φ 1 , Φ2,…, Φ푘 (also in 푅푛∗푛 ), so that each input X can be approximately decomposed as: • , Φ2,…, Φ푘 (also in 푅푛∗푛 ), so that each input X can be approximately decomposed as: • 푎푗휑푗 푘푗 =1 , s.t. 푎푗 are mostly zero (“sparse”) Slide 25 of 77
  32. 32. Feature Learning via Sparse Coding • 푗=1 푘 푎 푗 휑 푗 푗푗=1 푗=1 푘 푎 푗 휑 푗 푘푘 푗=1 푘 푎 푗 휑 푗 푎 푗 푎푎 푎 푗 푗푗 푎 푗 휑 푗 휑휑 휑 푗 푗푗 휑 푗 푗=1 푘 푎 푗 휑 푗 , s.t. 푎 푗 푎푎 푎 푗 푗푗 푎 푗 are mostly zero (“sparse”) • Φ 1 1 Φ 1 , Φ 2 Φ Φ 2 2 Φ 2 ,…, Φ 푘 Φ Φ 푘 푘푘 Φ 푘 (also in 푅 푛∗푛 푅푅 푅 푛∗푛 푛푛∗푛푛 푅 푛∗푛 ), so that each input X can be approximately decomposed as: •1) 푋 (1) , 푋 (2) 푋푋 푋 (2) (2) 푋 (2) ,…, 푋 (푚) 푋푋 푋 (푚) (푚푚) 푋 (푚) (each in 푅 푛∗푛 푅푅 푅 푛∗푛 푛푛∗푛푛 푅 푛∗푛 ) •Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). •X ≈ 푎푗휑푗 푘푗 =1 , s.t. 푎푗 are mostly zero (“sparse”) • , Φ2,…, Φ푘 (also in 푅푛∗푛 ), so that each input X can be approximately decomposed as: • 푎푗휑푗 푘푗 =1 , s.t. 푎푗 are mostly zero (“sparse”) Slide 25 of 77
  33. 33. Feature Learning via Sparse Coding Slide 26 of 77
  34. 34. Feature Learning via Sparse Coding Slide 27 of 77
  35. 35. Sparse Coding applied to Audio Slide 28 of 77
  36. 36. Learning Features Hierarchy Slide 29 of 77
  37. 37. Learning Features Hierarchy Slide 30 of 77
  38. 38. Features Hierarchy: Trained on face images Slide 31 of 77
  39. 39. Features Hierarchy: Trained on diff. categories Slide 32 of 77
  40. 40. Applications in Machine learning Slide 33 of 77
  41. 41. Phoneme Classification (TIMIT benchmark) Slide 34 of 77
  42. 42. State-of-the-art Slide 35 of 77
  43. 43. Brain Operation Modes Slide 36 of 77
  44. 44. Brain Operation Modes Slide 37 of 77 •Professor Daniel Khaneman, the Hero of Psychology. •Won in 2002, the Nobel Prize in economics. •Now he is teaching psychology in Princeton.
  45. 45. Brain Operation Modes Slide 38 of 77 •What do you see? •Angry Girl.
  46. 46. Brain Operation Modes Slide 39 of 77 •Now, What do you see? •Needs effort.
  47. 47. Slide 40 of 77 System One System Two
  48. 48. System One Slide 41 of 77 •It’s Automatic •Perceiving things + Skills =Answer •It is an intuitive process. •Intuition is Recognition
  49. 49. System One: Memory Slide 42 of 77
  50. 50. System One: Memory Slide 43 of 77 •By the age of three we all learned that “Big things can’t go inside small things”.
  51. 51. System One: Memory Slide 43 of 77 •By the age of three we all learned that “Big things can’t go inside small things”. •All of us have tried to save their favorite movie on the computer and we know that those two hours requires gabs of space.
  52. 52. System One: Memory Slide 44 of 77
  53. 53. System One: Memory Slide 45 of 77 •How do we cram the vast universe of our experience in a relatively small storage compartment between our ears?
  54. 54. System One: Memory Slide 45 of 77 •How do we cram the vast universe of our experience in a relatively small storage compartment between our ears? •We Cheat ! •Compress memories into critical thread and key features. •Ex: “Dinner was disappointing”, “Tough Steak”
  55. 55. System One: Memory Slide 45 of 77 •How do we cram the vast universe of our experience in a relatively small storage compartment between our ears? •We Cheat ! •Compress memories into critical thread and key features. •Ex: “Dinner was disappointing”, “Tough Steak” •Later when we want to remember our experience, our brains reweave, and not retrieve, the scenes using the extracted features.
  56. 56. System One: Memory Slide 46 of 77 Daniel Todd Gilbert is Professor of Psychology at Harvard University. In this experiment two groups of people set down to watch a set of slides, the question group and the now question group. The slides were about two cars approaching a yield sign, one car turns right and then the two cars collide.
  57. 57. System One: Memory Slide 46 of 77 Daniel Todd Gilbert is Professor of Psychology at Harvard University. In this experiment two groups of people set down to watch a set of slides, the question group and the now question group. The slides were about two cars approaching a yield sign, one car turns right and then the two cars collide.
  58. 58. System One: Memory Slide 46 of 77 Daniel Todd Gilbert is Professor of Psychology at Harvard University. In this experiment two groups of people set down to watch a set of slides, the question group and the now question group. The slides were about two cars approaching a yield sign, one car turns right and then the two cars collide.
  59. 59. System One: Memory Slide 47 of 77 •The no question group wasn’t asked any questions. •The question group was asked the following question: •Did another car pass by the blue car while it stopped at the Stop Sign? •And then they were asked to pick which set of slides did they see, the one with the yield sign or the one with the stop sign.
  60. 60. System One: Memory Slide 47 of 77 •90% of the no question group chose the yield sign •80% of the question group chose the stop sign
  61. 61. System One: Memory Slide 47 of 77 •90% of the no question group chose the yield sign •80% of the question group chose the stop sign •The general finding is: our brains compress experiences into key features and fill in details that were not actually stored. And this is the basic idea behind the auto-encoders
  62. 62. Sparse Auto-encoders Slide 48 of 77
  63. 63. •An Auto-encoder neural network is an unsupervised learning algorithm that applies back propagation, on a set of unlabeled training examples {푥1, 푥2, 푥4,….} where 푥푖 ∈푅푛 by setting the target values to be equal to the inputs.[6] •i.e. it uses 푦푖=푥푖 •Original contributions in back propagation was made by Hinton and Hebbian in 1980s and nowadays by Hinton , Salakhutdinov, Bengio, LeCun and Erhan (2006-2010) Sparse Auto-encoder Slide 49 of 77
  64. 64. •Before we get further into the details of the algorithm, we need to quickly go through neural network. •To describe neural networks, we will begin by describing the simplest possible neural network. One that comprises a single "neuron." We will use the following diagram to denote a single neuron [5] Neural Network Single Neuron [8] Slide 50 of 77
  65. 65. •This "neuron" is a computational unit that takes as input x1,x2,x3 (and a +1 intercept term), and outputs • ℎ푊,푏푋=푓푊푇푥=푓( 푊푖푥푖+푏)3푖 =1 where 푓:ℜ→ℜ is called the activation function. [5] Neural Network Slide 51 of 77
  66. 66. •The activation function can be:[8] 1)Sigmoid function : 푓푧= 11+exp (−푧) , output scale from [0,1] Sigmoid Activation Function Sigmoid Function [8] Slide 52 of 77
  67. 67. •2) Tanh function: : 푓푧=tanh(푧) 푒푧−푒−푧 푒푧+푒−푧 , output scale from [-1,1] Tanh Activation Function Tanh Function [8] Slide 53 of 77
  68. 68. •Neural network parameters are: •(W,b) = (W(1),b(1),W(2),b(2)), where we write 푊푖푗 (푙) to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l+ 1. •푏푖 (푙) the bias associated with unit i in layer l + 1. •푎푖 (푙) will denote the activation (meaning output value) of unit i in layer l. •Given a fixed setting of the parameters W, b, our neural network defines a hypothesis hW,b(x) that outputs a real number. Neural Network Model Slide 54 of 77
  69. 69. Cost Function Slide 55 of 77
  70. 70. •The auto-encoder tries to learn a function ℎ푤,푏(푥)≈푥 . In other words, it is trying an approximation to the identity function, so as to output 푥^ is similar to 푥 •Placing constraints on the network, such as limiting the number of hidden units, or imposing a sparsity constraint on the hidden units, lead to discover interesting structure in the data, even if the number of hidden units is large. Auto-encoders and Sparsity Slide 56 of 77
  71. 71. •Assumption : 1.The neurons to be inactive most of the time (a neuron to be "active" (or as "firing") if its output value is close to 1, or "inactive" if its output value is close to 0) and the activation function is sigmoid function. 2.Recall that 푎푗 (2) denotes the activation of hidden unit 푗 in layer 2 in the auto-encoder 3. 푎푗 (2) (x) to denote the activation of this hidden unit when the network is given a specific input 푥 4.Let: 휌 = 1 푚 [푎푗 (2)(푥푖)] 휌 푗=휌 푚푖=1 be the average activation unit 푗 (averaged over the training set). •Objective: •We would like to (approximately) enforce the constraint: 휌푗 = 휌 where 휌 is a sparsity parameter, a small value close to zero Auto-encoders and Sparsity Algorithm Slide 57 of 77
  72. 72. •To achieve this, we will add an extra penalty term to our optimization objective that penalizes : 휌푗 deviating significantly from 휌. • 휌log 휌 휌푗 푠2 푗=1 +(1- 휌) log1−휌 1−휌푗 , here “푠2” is the number of neurons in the hidden layer, and the index 푗 is the summing over the hidden units in the network.[6] • It can also be written 퐾퐿(휌 || 휌푗 )푠2 푗=1 where 퐾퐿(휌 || 휌푗 ) = 휌log 휌 휌푗 +(1-휌) log1−휌 1− 휌푗 is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean 휌 and a Bernoulli random variable with mean 휌푗 . [6] •KL-divergence is a standard function for measuring how different two different distributions are. Autoencoders and Sparsity Algorithm Slide 58 of 77
  73. 73. •Kl penalty function has the following property 퐾퐿(휌 || 휌푗 ) =0 if 휌푗 =휌 and otherwise it increases monotonically as 휌푗 diverges from 휌 . •For example, if we plotted 퐾퐿(휌 || 휌푗 ) for a range of values 휌푗 •(set 휌=0.2), We will see that the KL-divergence reaches its minimum •of 0 at 휌푗 = 휌 and approach ∞ as 휌푗 approaches 0 or 1. •Thus, minimizing this penalty term has the effect of causing 휌푗 • to close to 휌 Auto-encoders and Sparsity Algorithm –cont’d KL Function Slide 59 of 77
  74. 74. Sparse Auto-encoders Cost Function to minimize Slide 60 of 77
  75. 75. Gradient Checking Slide 61 of 77 • • • •
  76. 76. •We implemented a sparse auto-encoder, trained with 8×8 image patches using the L-BFGS optimization algorithm Auto-encoder Implementation A random sample of 200 patches from the dataset. Slide 62 of 77
  77. 77. Auto-encoder Implementation Slide 63 of 77 •We have trained it using digits from 0 to 9
  78. 78. AutoEncoder Visualization Slide 64 of 77
  79. 79. Auto-encoder Implementation Slide 65 of 77 •We have trained it with faces.
  80. 80. Auto-encoder with PCA flavor Slide 66 of 77 Eigen Vectors Percentage of Variance retained
  81. 81. Autoencoder Implementation Slide 67 of 77 50 100 150 200 300 350
  82. 82. Auto-encoder Performance Slide 68 of 77
  83. 83. In Progress Work (Future Results) •Given the fact of small dataset for facial features •We train the neural network with a random dataset in hope that the average mean would be a nice base start for the tuning phase of the neural network •We then fine tune with the smaller dataset of facial features Slide 69 of 77
  84. 84. Wrap up Slide 70 of 77
  85. 85. Slide 71 of 77 [Andrew Ng]
  86. 86. •Twitter: Data - Now Slide 72 of 77 •Facebook:
  87. 87. •Twitter: Data - Now Slide 72 of 77 7 terabytes of Data / Day •Facebook:
  88. 88. •Twitter: Data - Now Slide 72 of 77 7 terabytes of Data / Day •Facebook: 500 terabytes of Data / Day
  89. 89. •NASA announced its square kilometer telescope. Data – Tomorrow Slide 73 of 77
  90. 90. •NASA announced its square kilometer telescope. Data – Tomorrow Slide 73 of 77 •It will generate 700 terabyte of data every second.
  91. 91. •NASA announced its square kilometer telescope. Data – Tomorrow Slide 73 of 77 •It will generate 700 terabyte of data every second. •It will generate data of the same size as the internet today in two days. •Do you know how long it is going to take Google, with all its resources, to just index data generated from this beast in a year? 3 whole months, 90 days !
  92. 92. Slide 74 of 77 [Andrew Ng]
  93. 93. Thanks! Q? Slide 75 of 77

×