SlideShare a Scribd company logo
1 of 35
CS 188: Artificial Intelligence
Optimization and Neural Nets
Instructor: Mesut Xiaocheng Yang, Carl Qi --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Reminder: Linear Classifiers
 Inputs are feature values
 Each feature has a weight
 Sum is the activation
 If the activation is:
 Positive, output +1
 Negative, output -1

f1
f2
f3
w1
w2
w3
>0?
How to get probabilistic decisions?
 Activation:
 If very positive  want probability going to 1
 If very negative  want probability going to 0
 Sigmoid function
Multiclass Logistic Regression
 Multi-class linear classification
 A weight vector for each class:
 Score (activation) of a class y:
 Prediction w/highest score wins:
 How to make the scores into probabilities?
original activations softmax activations
Best w?
 Maximum likelihood estimation:
with:
= Multi-Class Logistic Regression
This Lecture
 Optimization
 i.e., how do we solve:
Hill Climbing
 General idea
 Start wherever
 Repeat: move to the best neighboring state
 If no neighbors better than current, quit
 What’s particularly tricky when hill-climbing for multiclass
logistic regression?
• Optimization over a continuous space
• Infinitely many neighbors!
• How to do this efficiently?
1-D Optimization
 Could evaluate and
 Then step in best direction
 Or, evaluate derivative:
 Tells which direction to step into
2-D Optimization
Source: offconvex.org
Gradient Ascent
 Perform update in uphill direction for each coordinate
 The steeper the slope (i.e. the higher the derivative) the bigger the step
for that coordinate
 E.g., consider:
 Updates:  Updates in vector notation:
with: = gradient
 Idea:
 Start somewhere
 Repeat: Take a step in the gradient direction
Gradient Ascent
Figure source: Mathworks
What is the Steepest Direction?
 First-Order Taylor Expansion:
 Steepest Ascent Direction:
 Recall: 
 Hence, solution: Gradient direction = steepest direction!
Gradient in n dimensions
Optimization Procedure: Gradient Ascent
 init
 for iter = 1, 2, …
 : learning rate --- tweaking parameter that needs to be
chosen carefully
 How? Try multiple choices
 Crude rule of thumb: update changes about 0.1 – 1 %
Batch Gradient Ascent on the Log Likelihood Objective
 init
 for iter = 1, 2, …
What will gradient ascent do?
adds f to the correct
class weights
for y’ weights:
subtracts f from y’ weights in proportion to
the probability current weights give to y’
𝑦𝑡ℎ
element
Stochastic Gradient Ascent on the Log Likelihood Objective
 init
 for iter = 1, 2, …
 pick random j
Observation: once gradient on one training example has been
computed, might as well incorporate before computing next one
Mini-Batch Gradient Ascent on the Log Likelihood Objective
 init
 for iter = 1, 2, …
 pick random subset of training examples J
Observation: gradient over small set of training examples (=mini-batch)
can be computed in parallel, might as well do that instead of a single one
 We’ll talk about that once we covered neural networks, which
are a generalization of logistic regression
How about computing all the derivatives?
Neural Networks
Multi-class Logistic Regression
 = special case of neural network
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
s
o
f
t
m
a
x
…
Deep Neural Network = Also learn the features!
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
s
o
f
t
m
a
x
…
Deep Neural Network = Also learn the features!
f1(x)
f2(x)
f3(x)
fK(x)
s
o
f
t
m
a
x
…
x1
x2
x3
xL
… … … …
…
g = nonlinear activation function
Deep Neural Network = Also learn the features!
s
o
f
t
m
a
x
…
x1
x2
x3
xL
… … … …
…
g = nonlinear activation function
Common Activation Functions
[source: MIT 6.S191 introtodeeplearning.com]
Deep Neural Network: Also Learn the Features!
 Training the deep neural network is just like logistic regression:
just w tends to be a much, much larger vector 
just run gradient ascent
+ stop when log likelihood of hold-out data starts to decrease
Neural Networks Properties
 Theorem (Universal Function Approximators). A two-layer neural
network with a sufficient number of neurons can approximate
any continuous function to any desired accuracy.
 Practical considerations
 Can be seen as learning the features
 Large number of neurons
 Danger for overfitting
 (hence early stopping!)
Universal Function Approximation Theorem*
 In words: Given any continuous function f(x), if a 2-layer neural
network has enough hidden units, then there is a choice of
weights that allow it to closely approximate f(x).
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”
Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”
Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation
Functions Can Approximate Any Function”
Universal Function Approximation Theorem*
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”
Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”
Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation
Functions Can Approximate Any Function”
Fun Neural Net Demo Site
 Demo-site:
 http://playground.tensorflow.org/
 Derivatives tables:
How about computing all the derivatives?
[source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
How about computing all the derivatives?
 But neural net f is never one of those?
 No problem: CHAIN RULE:
If
Then
 Derivatives can be computed by following well-defined procedures
 Automatic differentiation software
 e.g. PyTorch, TensorFlow, JAX
 Only need to program the function g(x,y,w)
 Can automatically compute all derivatives w.r.t. all entries in w
 This is typically done by caching info during forward computation pass
of f, and then doing a backward pass = “backpropagation”
 Autodiff / Backpropagation can often be done at computational cost
comparable to the forward pass
 Need to know this exists
 How this is done? -- outside of scope of CS188
Automatic Differentiation
Summary of Key Ideas
 Optimize probability of label given input
 Continuous optimization
 Gradient ascent:
 Compute steepest uphill direction = gradient (= just vector of partial derivatives)
 Take step in the gradient direction
 Repeat (until held-out data accuracy starts to drop = “early stopping”)
 Deep neural nets
 Last layer = still logistic regression
 Now also many more layers before this last layer
 = computing the features
  the features are learned rather than hand-designed
 Universal function approximation theorem
 If neural net is large enough
 Then neural net can represent any continuous mapping from input to output with arbitrary accuracy
 But remember: need to avoid overfitting / memorizing the training data  early stopping!
 Automatic differentiation gives the derivatives efficiently (how? = outside of scope of 188)
Next Time: Backpropagation

More Related Content

Similar to 19 - Neural Networks I.pptx

MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Zihui Li
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksBruno Gonçalves
 
Deep neural networks & computational graphs
Deep neural networks & computational graphsDeep neural networks & computational graphs
Deep neural networks & computational graphsRevanth Kumar
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
 
Building blocks of deep learning
Building blocks of deep learningBuilding blocks of deep learning
Building blocks of deep learningKeshan Sodimana
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 

Similar to 19 - Neural Networks I.pptx (20)

MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural Networks
 
Deep neural networks & computational graphs
Deep neural networks & computational graphsDeep neural networks & computational graphs
Deep neural networks & computational graphs
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Building blocks of deep learning
Building blocks of deep learningBuilding blocks of deep learning
Building blocks of deep learning
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

19 - Neural Networks I.pptx

  • 1. CS 188: Artificial Intelligence Optimization and Neural Nets Instructor: Mesut Xiaocheng Yang, Carl Qi --- University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
  • 2. Reminder: Linear Classifiers  Inputs are feature values  Each feature has a weight  Sum is the activation  If the activation is:  Positive, output +1  Negative, output -1  f1 f2 f3 w1 w2 w3 >0?
  • 3. How to get probabilistic decisions?  Activation:  If very positive  want probability going to 1  If very negative  want probability going to 0  Sigmoid function
  • 4. Multiclass Logistic Regression  Multi-class linear classification  A weight vector for each class:  Score (activation) of a class y:  Prediction w/highest score wins:  How to make the scores into probabilities? original activations softmax activations
  • 5. Best w?  Maximum likelihood estimation: with: = Multi-Class Logistic Regression
  • 6. This Lecture  Optimization  i.e., how do we solve:
  • 7. Hill Climbing  General idea  Start wherever  Repeat: move to the best neighboring state  If no neighbors better than current, quit  What’s particularly tricky when hill-climbing for multiclass logistic regression? • Optimization over a continuous space • Infinitely many neighbors! • How to do this efficiently?
  • 8. 1-D Optimization  Could evaluate and  Then step in best direction  Or, evaluate derivative:  Tells which direction to step into
  • 10. Gradient Ascent  Perform update in uphill direction for each coordinate  The steeper the slope (i.e. the higher the derivative) the bigger the step for that coordinate  E.g., consider:  Updates:  Updates in vector notation: with: = gradient
  • 11.  Idea:  Start somewhere  Repeat: Take a step in the gradient direction Gradient Ascent Figure source: Mathworks
  • 12. What is the Steepest Direction?  First-Order Taylor Expansion:  Steepest Ascent Direction:  Recall:   Hence, solution: Gradient direction = steepest direction!
  • 13. Gradient in n dimensions
  • 14. Optimization Procedure: Gradient Ascent  init  for iter = 1, 2, …  : learning rate --- tweaking parameter that needs to be chosen carefully  How? Try multiple choices  Crude rule of thumb: update changes about 0.1 – 1 %
  • 15. Batch Gradient Ascent on the Log Likelihood Objective  init  for iter = 1, 2, …
  • 16. What will gradient ascent do? adds f to the correct class weights for y’ weights: subtracts f from y’ weights in proportion to the probability current weights give to y’ 𝑦𝑡ℎ element
  • 17. Stochastic Gradient Ascent on the Log Likelihood Objective  init  for iter = 1, 2, …  pick random j Observation: once gradient on one training example has been computed, might as well incorporate before computing next one
  • 18. Mini-Batch Gradient Ascent on the Log Likelihood Objective  init  for iter = 1, 2, …  pick random subset of training examples J Observation: gradient over small set of training examples (=mini-batch) can be computed in parallel, might as well do that instead of a single one
  • 19.  We’ll talk about that once we covered neural networks, which are a generalization of logistic regression How about computing all the derivatives?
  • 21. Multi-class Logistic Regression  = special case of neural network z1 z2 z3 f1(x) f2(x) f3(x) fK(x) s o f t m a x …
  • 22. Deep Neural Network = Also learn the features! z1 z2 z3 f1(x) f2(x) f3(x) fK(x) s o f t m a x …
  • 23. Deep Neural Network = Also learn the features! f1(x) f2(x) f3(x) fK(x) s o f t m a x … x1 x2 x3 xL … … … … … g = nonlinear activation function
  • 24. Deep Neural Network = Also learn the features! s o f t m a x … x1 x2 x3 xL … … … … … g = nonlinear activation function
  • 25. Common Activation Functions [source: MIT 6.S191 introtodeeplearning.com]
  • 26. Deep Neural Network: Also Learn the Features!  Training the deep neural network is just like logistic regression: just w tends to be a much, much larger vector  just run gradient ascent + stop when log likelihood of hold-out data starts to decrease
  • 27. Neural Networks Properties  Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy.  Practical considerations  Can be seen as learning the features  Large number of neurons  Danger for overfitting  (hence early stopping!)
  • 28. Universal Function Approximation Theorem*  In words: Given any continuous function f(x), if a 2-layer neural network has enough hidden units, then there is a choice of weights that allow it to closely approximate f(x). Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”
  • 29. Universal Function Approximation Theorem* Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”
  • 30. Fun Neural Net Demo Site  Demo-site:  http://playground.tensorflow.org/
  • 31.  Derivatives tables: How about computing all the derivatives? [source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
  • 32. How about computing all the derivatives?  But neural net f is never one of those?  No problem: CHAIN RULE: If Then  Derivatives can be computed by following well-defined procedures
  • 33.  Automatic differentiation software  e.g. PyTorch, TensorFlow, JAX  Only need to program the function g(x,y,w)  Can automatically compute all derivatives w.r.t. all entries in w  This is typically done by caching info during forward computation pass of f, and then doing a backward pass = “backpropagation”  Autodiff / Backpropagation can often be done at computational cost comparable to the forward pass  Need to know this exists  How this is done? -- outside of scope of CS188 Automatic Differentiation
  • 34. Summary of Key Ideas  Optimize probability of label given input  Continuous optimization  Gradient ascent:  Compute steepest uphill direction = gradient (= just vector of partial derivatives)  Take step in the gradient direction  Repeat (until held-out data accuracy starts to drop = “early stopping”)  Deep neural nets  Last layer = still logistic regression  Now also many more layers before this last layer  = computing the features   the features are learned rather than hand-designed  Universal function approximation theorem  If neural net is large enough  Then neural net can represent any continuous mapping from input to output with arbitrary accuracy  But remember: need to avoid overfitting / memorizing the training data  early stopping!  Automatic differentiation gives the derivatives efficiently (how? = outside of scope of 188)

Editor's Notes

  1. Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!