SlideShare a Scribd company logo
Lizhi97@gmail.com
§ Linear Classifier
§ Feed Forward
§ LSTMs
§ GANs
§ Auto-encoders
§ Convolutional Neural Networks(CNN)
§ RNN(Recurrent)
§ RNN(Recursive)
§ Strategy
1. Select Network Structure appropriate for problem
2. Check for implementation bugs with gradient checks
3. Parameter initialization
4. Optimization
§ Gradient Descent
§ Stochastic Gradient Descent (SGD)
§ Mini-batch Stochastic Gradient Descent (SGD)
§ Momentum
§ Adagrad
5. Check if the model is powerful enough to overfit
§ It’s the building block of Neural Networks. Here’s an example of the very basics of
it applied to a photo.
§ Is an artificial neural network wherein connections between the units do not form a
cycle. In this network, the information moves in only one direction, forward, from
the input nodes, through the hidden nodes (if any) and to the output nodes.There
are no cycles or loops in the network.
§ Kinds
§ Single-Layer Perceptron
§ The inputs are fed directly to the outputs via a series of weights. By adding an Logistic activation
function to the outputs, the model is identical to a classical Logistic Regression model.
§ Multi-Layer Perceptron
§ This class of networks consists of multiple layers of computational units, usually interconnected in
a feed-forward way. Each neuron in one layer has directed connections to the neurons of the
subsequent layer. In many applications the units of these networks apply a sigmoid function as an
activation function.
§ An LSTM is well-suited to learn from experience to classify, process and predict time
series given time lags of unknown size and bound between important events. Relative
insensitivity to gap length gives an advantage to LSTM over alternative RNNs, hidden
Markov models and other sequence learning methods in numerous applications.
§ Long short-term memory - It is a type of recurrent (RNN), allowing data to flow both
forwards and backwards within the network.
§ GANs or Generative Adversarial Networks are a class of artificial intelligence
algorithms used in unsupervised machine learning, implemented by a system of two
neural networks contesting with each other in a zero-sum game framework.
§ The aim of an autoencoder is to learn a representation (encoding) for a set of data,
typically for the purpose of dimensionality reduction. Recently, the autoencoder
concept has become more widely used for learning generative models of data.
§ Is an artificial neural network used for unsupervised learning of efficient codings.
§ They have applications in image and video recognition, recommender systems and
natural language processing.
§ Pooling
§ Convolution
§ Subsampling
§ Is a class of artificial neural network where connections between units form a directed
cycle.This allows it to exhibit dynamic temporal behavior. Unlike feedforward neural
networks, RNNs can use their internal memory to process arbitrary sequences of
inputs.
§ This makes them applicable to tasks such as unsegmented, connected handwriting
recognition or speech recognition.
§ RNNs have been successful for instance in learning sequence and tree structures in
natural language processing, mainly phrase and sentence continuous representations
based on word embedding.
§ Is a kind of deep neural network created by applying the same set of weights
recursively over a structure, to produce a structured prediction over variable-size input
structures, or a scalar prediction on it, by traversing a given structure in topological
order.
§ Structure: Single words, fixed windows, sentence based, document level; bag of
words, recursive vs. recurrent, CNN
§ Nonlinearity (Activation Functions)
§ Implement your gradient
§ Implement a finite difference computation by looping through the parameters of your
network, adding and subtracting a small epsilon (∼10-4) and estimate derivatives
§ Compare the two and make sure they are almost the same
§ Using Gradient Checks
§ If you gradient fails and you don’t know why?
§ Simplify your model until you have no bug!
§ What now? Create a very tiny synthetic model and dataset
§ Example: Start from simplest model then go to what you want:
§ Only softmax on fixed input
§ Backprop into word vectors and softmax
§ Add single unit single hidden layer
§ Add multi unit single layer
§ Add second layer single unit, add multiple units, bias • Add one softmax on top, then two softmax layers
§ Add bias
§ Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal
value if weights were 0 (e.g., mean target or inverse sigmoid of mean target).
§ Initialize weights ∼ Uniform(−r, r), r inversely proportional to fan-in (previous layer
size) and fan-out (next layer size):
§ Gradient Descent
§ Stochastic Gradient Descent (SGD)
§ Mini-batch Stochastic Gradient Descent (SGD)
§ Momentum
§ Adagrad
§ Is a first-order iterative optimization algorithm for finding the minimum of a
function.To find a local minimum of a function using gradient descent, one takes
steps proportional to the negative of the gradient (or of the approximate gradient)
of the function at the current point. If instead one takes steps proportional to the
positive of the gradient, one approaches a local maximum of that function; the
procedure is then known as gradient ascent.
§ Gradient descent uses total gradient over all examples per update, SGD updates
after only 1 or few examples:
§ Ordinary gradient descent as a batch method is very slow, should never be used.
Use 2nd order batch method such as L-BFGS.
§ On large datasets, SGD usually wins over all batch methods. On smaller datasets L-
BFGS or Conjugate Gradients win. Large-batch L-BFGS extends the reach of L-BFGS
[Le et al. ICML 2011].
§ Gradient descent uses total gradient over all examples per update, SGD updates
after only 1 example
§ Most commonly used now, Size of each mini batch B: 20 to 1000
§ Helps parallelizing any model by computing gradients for multiple elements of the
batch in parallel
§ Idea: Add a fraction v of previous update to current one.When the gradient keeps
pointing in the same direction, this will increase the size of the steps taken towards
the minimum.
§ Reduce global learning rate when using a lot of momentum
§Update Rule
§ v is initialized at 0
§ Momentum often increased after some epochs (0.5 à 0.99)
§ Adaptive learning rates for each parameter!
§ Learning rate is adapting differently for each parameter and rare parameters get
larger updates than frequently occurring parameters.Word vectors!
§ If not, change model structure or make model “larger”
§ If you can overfit: Regularize to prevent overfitting:
§ Simple first step: Reduce model size by lowering number of units and layers and other
parameters
§ Standard L1 or L2 regularization on weights
§ Early Stopping: Use parameters that gave best validation error
§ Sparsity constraints on hidden activations, e.g., add to cost:
§ Dropout
§ Training time: at each instance of evaluation (in online SGD-training), randomly set 50% of the inputs to
each neuron to 0
§ Test time: halve the model weights (now twice as many) This prevents feature co-adaptation: A feature
cannot only be useful in the presence of particular other features
§ In a single layer: A kind of middle-ground between Naïve Bayes (where all feature weights are set
independently) and logistic regression models (where weights are set in the context of all others)
§ Can be thought of as a form of model bagging
§ It also acts as a strong regularizer

More Related Content

What's hot

Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo,  EPFL, SwitzerlandHybrid neural networks for time series learning by Tian Guo,  EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
EuroIoTa
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 

What's hot (16)

The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo,  EPFL, SwitzerlandHybrid neural networks for time series learning by Tian Guo,  EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
 
Nondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evilNondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evil
 
Association of deep learning algorithm with fuzzy logic for multi-document te...
Association of deep learning algorithm with fuzzy logic for multi-document te...Association of deep learning algorithm with fuzzy logic for multi-document te...
Association of deep learning algorithm with fuzzy logic for multi-document te...
 
Handwriting recognition
Handwriting recognitionHandwriting recognition
Handwriting recognition
 
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
Captcha-recognition-with-active-deep-learning
Captcha-recognition-with-active-deep-learningCaptcha-recognition-with-active-deep-learning
Captcha-recognition-with-active-deep-learning
 
Data integration with rabbit mq and celery
Data integration with rabbit mq and celeryData integration with rabbit mq and celery
Data integration with rabbit mq and celery
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency Trading
 

Similar to Deep learning architectures

Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Giacomo Boracchi
 
Smartphone Activity Prediction
Smartphone Activity PredictionSmartphone Activity Prediction
Smartphone Activity Prediction
Triskelion_Kaggle
 

Similar to Deep learning architectures (20)

Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Smartphone Activity Prediction
Smartphone Activity PredictionSmartphone Activity Prediction
Smartphone Activity Prediction
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Designing your neural networks – a step by step walkthrough
Designing your neural networks – a step by step walkthroughDesigning your neural networks – a step by step walkthrough
Designing your neural networks – a step by step walkthrough
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Conformer review
Conformer reviewConformer review
Conformer review
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Large Scale Distributed Deep Networks
Large Scale Distributed Deep NetworksLarge Scale Distributed Deep Networks
Large Scale Distributed Deep Networks
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep Dive
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 

Deep learning architectures

  • 2. § Linear Classifier § Feed Forward § LSTMs § GANs § Auto-encoders § Convolutional Neural Networks(CNN) § RNN(Recurrent) § RNN(Recursive) § Strategy 1. Select Network Structure appropriate for problem 2. Check for implementation bugs with gradient checks 3. Parameter initialization 4. Optimization § Gradient Descent § Stochastic Gradient Descent (SGD) § Mini-batch Stochastic Gradient Descent (SGD) § Momentum § Adagrad 5. Check if the model is powerful enough to overfit
  • 3. § It’s the building block of Neural Networks. Here’s an example of the very basics of it applied to a photo.
  • 4. § Is an artificial neural network wherein connections between the units do not form a cycle. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes.There are no cycles or loops in the network. § Kinds § Single-Layer Perceptron § The inputs are fed directly to the outputs via a series of weights. By adding an Logistic activation function to the outputs, the model is identical to a classical Logistic Regression model. § Multi-Layer Perceptron § This class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer. In many applications the units of these networks apply a sigmoid function as an activation function.
  • 5. § An LSTM is well-suited to learn from experience to classify, process and predict time series given time lags of unknown size and bound between important events. Relative insensitivity to gap length gives an advantage to LSTM over alternative RNNs, hidden Markov models and other sequence learning methods in numerous applications. § Long short-term memory - It is a type of recurrent (RNN), allowing data to flow both forwards and backwards within the network.
  • 6. § GANs or Generative Adversarial Networks are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework.
  • 7. § The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of data. § Is an artificial neural network used for unsupervised learning of efficient codings.
  • 8. § They have applications in image and video recognition, recommender systems and natural language processing. § Pooling § Convolution § Subsampling
  • 9. § Is a class of artificial neural network where connections between units form a directed cycle.This allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process arbitrary sequences of inputs. § This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.
  • 10. § RNNs have been successful for instance in learning sequence and tree structures in natural language processing, mainly phrase and sentence continuous representations based on word embedding. § Is a kind of deep neural network created by applying the same set of weights recursively over a structure, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order.
  • 11. § Structure: Single words, fixed windows, sentence based, document level; bag of words, recursive vs. recurrent, CNN § Nonlinearity (Activation Functions)
  • 12. § Implement your gradient § Implement a finite difference computation by looping through the parameters of your network, adding and subtracting a small epsilon (∼10-4) and estimate derivatives § Compare the two and make sure they are almost the same § Using Gradient Checks § If you gradient fails and you don’t know why? § Simplify your model until you have no bug! § What now? Create a very tiny synthetic model and dataset § Example: Start from simplest model then go to what you want: § Only softmax on fixed input § Backprop into word vectors and softmax § Add single unit single hidden layer § Add multi unit single layer § Add second layer single unit, add multiple units, bias • Add one softmax on top, then two softmax layers § Add bias
  • 13. § Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target). § Initialize weights ∼ Uniform(−r, r), r inversely proportional to fan-in (previous layer size) and fan-out (next layer size):
  • 14. § Gradient Descent § Stochastic Gradient Descent (SGD) § Mini-batch Stochastic Gradient Descent (SGD) § Momentum § Adagrad
  • 15. § Is a first-order iterative optimization algorithm for finding the minimum of a function.To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
  • 16. § Gradient descent uses total gradient over all examples per update, SGD updates after only 1 or few examples: § Ordinary gradient descent as a batch method is very slow, should never be used. Use 2nd order batch method such as L-BFGS. § On large datasets, SGD usually wins over all batch methods. On smaller datasets L- BFGS or Conjugate Gradients win. Large-batch L-BFGS extends the reach of L-BFGS [Le et al. ICML 2011].
  • 17. § Gradient descent uses total gradient over all examples per update, SGD updates after only 1 example § Most commonly used now, Size of each mini batch B: 20 to 1000 § Helps parallelizing any model by computing gradients for multiple elements of the batch in parallel
  • 18. § Idea: Add a fraction v of previous update to current one.When the gradient keeps pointing in the same direction, this will increase the size of the steps taken towards the minimum. § Reduce global learning rate when using a lot of momentum §Update Rule § v is initialized at 0 § Momentum often increased after some epochs (0.5 à 0.99)
  • 19. § Adaptive learning rates for each parameter! § Learning rate is adapting differently for each parameter and rare parameters get larger updates than frequently occurring parameters.Word vectors!
  • 20. § If not, change model structure or make model “larger” § If you can overfit: Regularize to prevent overfitting: § Simple first step: Reduce model size by lowering number of units and layers and other parameters § Standard L1 or L2 regularization on weights § Early Stopping: Use parameters that gave best validation error § Sparsity constraints on hidden activations, e.g., add to cost: § Dropout § Training time: at each instance of evaluation (in online SGD-training), randomly set 50% of the inputs to each neuron to 0 § Test time: halve the model weights (now twice as many) This prevents feature co-adaptation: A feature cannot only be useful in the presence of particular other features § In a single layer: A kind of middle-ground between Naïve Bayes (where all feature weights are set independently) and logistic regression models (where weights are set in the context of all others) § Can be thought of as a form of model bagging § It also acts as a strong regularizer