The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
Transfer Learning
What Is Transfer Learning?
How Does Transfer Learning Work?
Why Is Transfer Learning Used?
When Should Transfer Learning Be Used?
Approaches to Transfer Learning
I. Hill climbing algorithm II. Steepest hill climbing algorithmvikas dhakane
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
Transfer Learning
What Is Transfer Learning?
How Does Transfer Learning Work?
Why Is Transfer Learning Used?
When Should Transfer Learning Be Used?
Approaches to Transfer Learning
I. Hill climbing algorithm II. Steepest hill climbing algorithmvikas dhakane
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
Volume rendering 3D volume data (medical CT scans) in Unity3D.
Covering the following topics:
- Raymarching
- Maximum Intensity Projection
- Direct Volume Rendering with compositing
- Isosurface rendering
- Transfer functions
- 2D Transfer Functions
- Slice rendering
Source code here: https://github.com/mlavik1/UnityVolumeRendering
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
This chapter shows how to use knowledge about the wlorld to make decisions even when the
outcomes of an action are uncertain and the rewards for acting might not be reaped until many
actions have passed. The main points are as follows:
e Sequential decision problems in uncertain envirsinments,also called Markov decision
processes, or MDPs, are defined by a transition model specifying the probabilistic
outcomes of actions and a reward function specifying the reward in each state.
o The utility of a state sequence is the sum of all the rewards over the sequence, possibly
discounted over time. The solution of an MDP is a policy that associates a decision
with every state that the agent might reach. An optimal policy maximizes the utility of
the state sequences encountered when it is execut~ed.
e The utility of a state is the expected utility of the state sequences encountered when
an optimal policy is executed, starting in that state. The value iteration algorithm for
solving MDPs works by iteratively solving the equations relating the utilities of each
state to that of its neighbors.
Policy iteration alternates between calculating the utilities of states under the current
policy and improving the current policy with respect to the current utilities.
* Partially observable MDPs, or POMDPs, are much more difficult to solve than are
MDPs. They can be solved by conversion to an MDP in the continuous space of belief
states. Optimal behavior in POMDPs includes information gathering to reduce uncertainty and therefore make better decisions in the fiuture.
A decision-theoretic agent can be constructed for POMDP environments. The agent
uses a dynamic decision network to represent the transition and observation models,
to update its belief state, and to project forward possible action sequences.
Game theory describes rational behavior for agents in situations where multiple agents
interact simultaneously. Solutions of games are Nash equilibria-strategy profiles in
which no agent has an incentive to deviate from the specified strategy.
Mechanism design can be used to set the rules by which agents will interact, in order
to maximize some global utility through the operation of individually rational agents.
Sometimes, mechanisms exist that achieve this goal without requiring each agent to
consider the choices made by other agents.
We shall return to the world of MDPs and POMDP in Chapter 21, when we study reinforcement learning methods that allow an agent to improve its behavior from experience in sequential, uncertain environments.
Abstract: This PDSG workship introduces basic concepts on using Hill Climbing for Local Search. Concepts covered are global and local maximum, shoulder/flat, value functions, local beam search, and stochastic variant.
Level: Fundamental
Requirements: Should have prior familiarity with Graph Search. No prior programming knowledge is required.
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
A summary of Chapter 3: Finite Markov Decision Processes of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
Volume rendering 3D volume data (medical CT scans) in Unity3D.
Covering the following topics:
- Raymarching
- Maximum Intensity Projection
- Direct Volume Rendering with compositing
- Isosurface rendering
- Transfer functions
- 2D Transfer Functions
- Slice rendering
Source code here: https://github.com/mlavik1/UnityVolumeRendering
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
This chapter shows how to use knowledge about the wlorld to make decisions even when the
outcomes of an action are uncertain and the rewards for acting might not be reaped until many
actions have passed. The main points are as follows:
e Sequential decision problems in uncertain envirsinments,also called Markov decision
processes, or MDPs, are defined by a transition model specifying the probabilistic
outcomes of actions and a reward function specifying the reward in each state.
o The utility of a state sequence is the sum of all the rewards over the sequence, possibly
discounted over time. The solution of an MDP is a policy that associates a decision
with every state that the agent might reach. An optimal policy maximizes the utility of
the state sequences encountered when it is execut~ed.
e The utility of a state is the expected utility of the state sequences encountered when
an optimal policy is executed, starting in that state. The value iteration algorithm for
solving MDPs works by iteratively solving the equations relating the utilities of each
state to that of its neighbors.
Policy iteration alternates between calculating the utilities of states under the current
policy and improving the current policy with respect to the current utilities.
* Partially observable MDPs, or POMDPs, are much more difficult to solve than are
MDPs. They can be solved by conversion to an MDP in the continuous space of belief
states. Optimal behavior in POMDPs includes information gathering to reduce uncertainty and therefore make better decisions in the fiuture.
A decision-theoretic agent can be constructed for POMDP environments. The agent
uses a dynamic decision network to represent the transition and observation models,
to update its belief state, and to project forward possible action sequences.
Game theory describes rational behavior for agents in situations where multiple agents
interact simultaneously. Solutions of games are Nash equilibria-strategy profiles in
which no agent has an incentive to deviate from the specified strategy.
Mechanism design can be used to set the rules by which agents will interact, in order
to maximize some global utility through the operation of individually rational agents.
Sometimes, mechanisms exist that achieve this goal without requiring each agent to
consider the choices made by other agents.
We shall return to the world of MDPs and POMDP in Chapter 21, when we study reinforcement learning methods that allow an agent to improve its behavior from experience in sequential, uncertain environments.
Abstract: This PDSG workship introduces basic concepts on using Hill Climbing for Local Search. Concepts covered are global and local maximum, shoulder/flat, value functions, local beam search, and stochastic variant.
Level: Fundamental
Requirements: Should have prior familiarity with Graph Search. No prior programming knowledge is required.
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
A summary of Chapter 3: Finite Markov Decision Processes of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm5Ekd
This CloudxLab Key-Value RDD Transformations tutorial helps you to understand Key-Value RDD transformations in detail. Below are the topics covered in this tutorial:
1) Transformations on Key-Value Pair RDD - keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), cogroup(), countByKey() and lookup()
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyP2Ct
This CloudxLab Introduction to NoSQL tutorial helps you to understand NoSQL in detail. Below are the topics covered in this slide:
1) Introduction to NoSQL
2) Scaling Out vs Scaling Up
3) ACID - Properties of DB Transactions
4) RDBMS - Story
5) What is NoSQL?
6) Types Of NoSQL Stores
7) CAP Theorem
8) Serialization
9) Column Oriented Database
10) Column Family Oriented DataStore
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/6n3vko )
This CloudxLab TensorFlow tutorial helps you to understand TensorFlow in detail. Below are the topics covered in this tutorial:
1) Why TensorFlow?
2) What are Tensors?
3) What is TensorFlow?
4) Creating your First Graph
5) Linear Regression with TensorFlow
6) Implementing Gradient Descent using TensorFlow
7) Implementing Gradient Descent Using autodiff
8) Implementing Gradient Descent Using an Optimizer
9) Graph Visualization using TensorBoard
10) Name Scopes in TensorFlow
11) Modularity in TensorFlow
12) Sharing Variables in TensorFlow
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
2. Training Deep Neural Nets
Training Deep Neural Nets
● In previous chapter
○ We introduced artificial neural networks and
○ Trained our first deep neural network
○ It was a shallow NN
■ With only two hidden layers
○ This shallow neural network will not work if
■ We have to deal with complex problems such as
■ Detecting hundreds of objects in high-resolution images
3. Training Deep Neural Nets
Training Deep Neural Nets
● In that case, we may need to train a deeper neural network containing
○ Many layers
○ Each layer containing hundred of neurons
○ Connected by hundreds of thousands of connections
4. Training Deep Neural Nets
Training Deep Neural Nets
Question
What will be the challenges in training such a
deep neural network?
5. Training Deep Neural Nets
Training Deep Neural Nets
● We may face problem of vanishing gradients (which we will cover
shortly)
● Training such a large network will take a lot of time
● Such model with millions of parameters may be prone to overfitting
6. Training Deep Neural Nets
Training Deep Neural Nets
● In this chapter we will
○ Go through the vanishing gradients problem
■ And explore solutions to it
○ Look at various optimizers that can speed up training large models
● We will also look at
○ Popular regularization techniques for large neural networks
8. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● As discussed earlier
○ Backpropagation algorithm works by going from the
○ Output layer to the input layer
○ Propagating the error gradient on the way
9. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Once the algorithm computes the gradient of the cost function
○ With regards to each parameter in the network
○ Then it uses these gradients to update each parameter
10. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Here the problem is that
○ Gradients often get smaller and smaller
○ As the algorithm progresses down to the early layers
11. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Because of this,
○ The lower layer connection weights virtually remains unchanged
○ And training never converges to a good solution
○ This is called the vanishing gradients problem
12. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s understand Vanishing Gradient Problem with an example
13. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s recall sigmoid function
○ Popular activation function for ANN in classification context
○ Its output is in range of 0 to 1
Check the code to plot sigmoid function in the notebook
15. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s look at the derivative of sigmoid function
Sigmoid Function
Derivative of Sigmoid
S (1 - S)
16. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s plot the derivative of sigmoid function
Derivative of Sigmoid function
17. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s plot the derivative of sigmoid function
● As we can see
○ The output of the derivative of the Sigmoid function is
○ Always between 0 and ¼ (0.25)
Derivative of Sigmoid function
18. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now let’s look at the below univariate neural network
○ It has 2 hidden layers
○ act() is a sigmoid activation function
○ J returns the aggregate error of the model
Univariate 2-layer Neural Network
19. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now as per the chain rule in backpropagation
○ Rate of change in error because of weight w1 is
Univariate 2-layer Neural Network
20. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s focus on individual derivative for now
Univariate 2-layer Neural Network
21. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● A typical approach of weight initialization in a neural network is to
○ Choose weights using a normal distribution with
■ Mean of 0 and
■ Standard deviation of 1
○ Hence, the weights in the neural network are usually
■ Between -1 and 1
22. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now let’s come back to our individual derivative
● As we have seen in the past that
○ Output of derivative of sigmoid function lies between 0 and ¼
● And we have just discussed that
○ Weights in the neural network are usually between -1 and 1
< ¼ < 1
23. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Important - If we multiply two numbers between 0 and 1
○ Then the result will always be smaller
○ For example
○ ⅓ * ¼ = 1/12 (which is less than ⅓ and ¼)
● Here we are multiplying 2 values which are between 0 and 1
○ And the resulting gradient will be smaller
< ¼ < 1
24. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Now let’s take another individual derivative
Univariate 2-layer Neural Network
25. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● This derivative has
○ Two sigmoid activation function
○ And here we multiply 4 values between 0 and 1
○ So this gradient will be really smaller than
○ The earlier derivative (∂output / ∂hidden2)
< ¼ < ¼< 1 < 1
26. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● So we can see that in the backpropagation as we move backward
○ Gradient just becomes smaller and smaller in every layer
○ And it becomes tiny in the early layers (input layers or the first layers)
○ This is called as Vanishing Gradient Problem
< ¼ < ¼< 1 < 1
27. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Let’s understand it once again
● Below is 2-layer neural network
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Backpropagation
28. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Gradients will be largest in the output layer
○ Hence output layer is easiest to train
Largest gradients in
output layer
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Backpropagation
29. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Hidden layer 2 have
○ Smaller gradients than output layer
Smaller gradients in
hidden layer 2 than
output layer
Backpropagation
Input Layer Output LayerHidden Layer 1
Hidden Layer
2
30. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Hidden layer 1 have
○ Smaller gradients than hidden layer 2
Smaller gradients in hidden layer
1 than hidden layer 2
Input Layer Output Layer
Hidden Layer
1
Hidden Layer 2
Backpropagation
31. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● As input layer is farthest from the output layer
○ Its derivative will be the longer expression (using chain rule)
○ Hence it will contain more sigmoid derivatives
○ And it will have smallest derivative
○ This makes lower layers slowest to train
Smallest derivative in input layer
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Backpropagation
33. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
First problem
● Since gradient becomes really small in early layers (input layers)
○ It becomes really slow to train the early layers
Flat surface - small gradients.
Gradient Descent converges
slowly Larger gradients. Gradient
Descent converges fast
34. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
First problem
● Also because of small steps
○ May converge at a local minimum instead of global minimum
Flat surface - small gradients.
Gradient Descent converges
slowly Larger gradients. Gradient
Descent converges fast
35. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
Second problem
● Since the latter layers are dependent on the early layers
○ If early layers are not accurate
○ Then the latter or lower layers just build on this inaccuracy
○ And the entire neural net gets corrupted
● Early layers are responsible for
○ Detecting simple patterns and are
○ Building blocks of the neural network
○ Hence it becomes important that early layers are accurate
36. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
Second problem
● For example, in face recognition
○ Early layers detects the edges
○ Which gets combined to form facial features later in the network
● And if early layers get it wrong
○ The result built up by the neural network will be wrong
Original Image
Image seen by
neural
network
37. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
Exploding Gradients Problem
● Like vanishing gradients problem
○ We can also have exploding gradients problem
○ If the gradients were bigger than 1 (multiplying numbers greater than 1
always gives huge result)
○ Because of this, some layers may get insanely large weights and
○ The algorithm diverges instead of converging
○ This is called Exploding Gradients Problem
38. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● As we have seen deep neural networks suffer from unstable gradients
○ Different layers may learn at widely different speeds
● Because of vanishing gradients problem
○ Deep Neural Network were abandoned for a long time
○ Training the early layer correctly was the basis of network
○ But it proved too difficult that time because of
○ Available activation functions and hardware
39. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● In 2010, Xavier Glorot and Yoshua Bengio published a paper titled
○ “Understanding the Difficulty of Training Deep Feedforward
Neural Networks”
● Authors of this paper suggested that root cause of vanishing gradient
problem is
○ Nature of the sigmoid activation function derivative
40. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● If input is large,
○ Sigmoid function saturates at 0 or 1
○ And its derivative becomes extremely close to 0
41. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Thus when backpropagation kicks in
○ There is no gradient to propagate back through the network
○ And the little gradient that exists gets diluted as
○ Backpropagation reaches the early layers
○ So there is nothing left for early layers
42. Training Deep Neural Nets
Question
So what is the solution of vanishing gradients
problem?
43. Training Deep Neural Nets
Answer:
Good strategy for initializing weights
&
Use better activation functions
44. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
● Kaiming He suggested strategy for initializing the weights
○ To avoid vanishing gradients problem
○ It’s called He initialization
○ with below parameters for various activation functions
45. Training Deep Neural Nets
Vanishing / Exploding Gradients Problem
HE Initialization
import tensorflow as tf
reset_graph()
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
he_init =
tf.contrib.layers.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
kernel_initializer=he_init, name="hidden1")
46. Training Deep Neural Nets
ReLU Activation Function
● It turns out that ReLU activation function works better for Deep Neural
Networks
○ Because it does not saturate for positive values
○ And it is quite fast to compute
ReLU (z) = max (0, z)
47. Training Deep Neural Nets
ReLU Activation Function
Derivative of ReLU activation function
● It is not differentiable at x = 0
Derivative of ReLU activation function
For positive inputs , the
derivative is always 1
48. Training Deep Neural Nets
ReLU Activation Function
Derivative of ReLU activation function
● So with ReLU our gradients will never vanish
● As long as inputs are positive
Derivative of ReLU activation function
For positive inputs , the
derivative is always 1
49. Training Deep Neural Nets
Question
Do you see any problem with the derivative of
ReLU activation function?
50. Training Deep Neural Nets
ReLU Activation Function
● ReLU suffers from problem known as the dying ReLUs
● For negative inputs derivative is zero
Derivative of ReLU activation function
For negative inputs , the
derivative is always 0
51. Training Deep Neural Nets
ReLU Activation Function
Dying ReLUs
● Because of dying ReLUs, during training
○ Some neurons effectively die and
○ They stop outputting anything other than 0
○ It completely blocks the backpropagation
53. Training Deep Neural Nets
Leaky ReLU
● To solve dying ReLUs problem we use
○ Variant of ReLUs known as leaky ReLU
● Leaky ReLU output a very small gradient when the input is negative
Leaky ReLU
is the hyperparameter
which defines how much the
function “leaks” and is
typically set to “0.01”
= 0.01
RELU(x) = max( x, x)
54. Training Deep Neural Nets
Leaky ReLU
● This small gradient ensures that the
○ Leaky ReLUs never die
● In the recent researches it has been shown that
○ Setting = 0.2 (huge leak) results in better performance
56. Training Deep Neural Nets
Leaky ReLU
# Implementing Leaky ReLU in TensorFlow
reset_graph()
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
def leaky_relu(z, name=None):
return tf.maximum(0.01 * z, z, name=name)
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu,
name="hidden1")
57. Training Deep Neural Nets
Leaky ReLU
Follow the code in the notebook to train a
neural network on MNIST using the Leaky
ReLU
58. Training Deep Neural Nets
ELU Activation Function
● In 2015, Djork-Arné Clevert et al proposed a new activation function
○ ELU - Exponential Linear Unit
● It outperformed all the ReLU variants in their experiments
○ Training time was reduced and
○ Neural network performed better on the test set
60. Training Deep Neural Nets
ELU Activation Function
● In ELU equation, the hyperparameter defines the value
○ That ELU function approaches to when z is a large negative number
○ is usually set to 1
○ But we can tweak it like any other hyperparameter
61. Training Deep Neural Nets
ELU Activation Function
Advantage over ReLU
● It has a nonzero gradient for z < 0
○ Which avoids the dying units issue
ELU ReLU
62. Training Deep Neural Nets
ELU Activation Function
Advantage over ReLU
● It is smooth everywhere including around z = 0
○ It helps speedup Gradient Descent
ELU ReLU
63. Training Deep Neural Nets
ELU Activation Function
Drawbacks over ReLU
● Because of the use of exponential function
○ It is slower to compute than the ReLU
● But during training this slowness gets compensated by
○ The faster convergence rate
● However during testing
○ ELU networks are slower than the ReLU networks
64. Training Deep Neural Nets
ELU Activation Function
# ELU plot
def elu(z, alpha=1):
return np.where(z < 0, alpha * (np.exp(z) - 1), z)
plt.plot(z, elu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1, -1], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"ELU activation function ($alpha=1$)", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])
plt.show()
65. Training Deep Neural Nets
ELU Activation Function
# Implementing ELU in TensorFlow
reset_graph()
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu,
name="hidden1")
66. Training Deep Neural Nets
SELU Activation Function
● In June 2017, Günter Klambauer, Thomas Unterthiner and Andreas Mayr
○ Proposed SELU activation function
○ It outperforms the other activation functions
○ Very significantly for deep neural networks
○ Even for 100 layer deep neural network
67. Training Deep Neural Nets
SELU Activation Function
SELU Function in Python
def selu(z,
scale=1.0507009873554804934193349852946,
alpha=1.6732632423543772848170429916717):
return scale * elu(z, alpha)
68. Training Deep Neural Nets
SELU Activation Function
Plot SELU Function
plt.plot(z, selu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1.758, -1.758], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"SELU activation function", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])
plt.show()
69. Training Deep Neural Nets
SELU Activation Function
● With this activation function
○ Even a 100 layer deep neural network
○ Preserves roughly mean 0 and standard deviation 1 across all layers
○ Avoiding the exploding/vanishing gradients problem
70. Training Deep Neural Nets
SELU Activation Function
Check the mean and standard deviation in the deep layers
np.random.seed(42)
Z = np.random.normal(size=(500, 100))
for layer in range(100):
W = np.random.normal(size=(100, 100), scale=np.sqrt(1/100))
Z = selu(np.dot(Z, W))
means = np.mean(Z, axis=1)
stds = np.std(Z, axis=1)
if layer % 10 == 0:
print("Layer {}: {:.2f} < mean < {:.2f}, {:.2f} < std
deviation < {:.2f}".format(
layer, means.min(), means.max(), stds.min(), stds.max()))
71. Training Deep Neural Nets
SELU Activation Function
Follow the code in the notebook to create a
neural net for MNIST using the SELU activation
function
73. Training Deep Neural Nets
Which Activation Function to Use?
Answer
In general,
SELU > ELU > Leaky ReLU > ReLU > tanh > logistic
Vanishing gradient
74. Training Deep Neural Nets
Which Activation Function to Use?
● If runtime performance is important then
○ Prefer Leaky ReLUs over ELUs
● Also instead of tweaking hyperparameter
○ We may use default suggested values
■ 0.2 for the leaky ReLUs and
■ 1 for ELU
● If we have spare time and computing power
○ Use cross-validation to evaluate the other activation functions
76. Training Deep Neural Nets
Batch Normalization
● Using He initialization and proper activation functions
○ Like ELU or any variant of ReLU
○ Vanishing / exploding gradient problem significantly reduces
○ But there is no guarantee that
○ This problem will not come back during training
● In 2015, Sergey Ioffe and Christian Szegedy
○ Proposed a technique called Batch Normalization (BN)
○ To address the vanishing/exploding gradients problems
77. Training Deep Neural Nets
Batch Normalization
● Batch Normalization helps in
○ Vanishing gradient problem and
○ It also helps the neural network to learn faster
● Let’s understand Batch Normalization
78. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● As discussed earlier in machine learning projects
○ Gradient Descent does not work well
○ If the input features are on different scales
○ Like say if we have number of miles individual has driven in last 5 years
■ This data can have a large varying scale
■ As someone might have driven 100, 000 miles
■ While other person might have driven 100 miles
■ So here the range is 100 - 100, 000
79. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● One of the techniques of feature scaling is
○ Standardization
● In Standardization, features are rescaled
○ So that output will have the properties of
○ Standard normal distribution with
■ Zero mean and
■ Unit variance
Mean
Standard
Deviation
80. Training Deep Neural Nets
Batch Normalization - Feature Scaling
Standardization
● The general method of calculation
○ Calculate distribution mean and standard deviation for each feature
○ Subtract the mean from each feature
○ Divide the result from previous step of each feature by its standard
deviation
Standardized Value
81. Training Deep Neural Nets
Batch Normalization - Feature Scaling
Standardization
● As a preprocessing step
○ We apply standardization to the input dataset
○ So that all the features will have same scale
■ With 0 mean
■ And unit standard deviation
○ And Gradient Descent converges faster
82. Training Deep Neural Nets
Batch Normalization - Feature Scaling
Input Layer Output LayerHidden Layer 1 Hidden Layer 2
Two Layer - Neural Network
Normalized Input Features
83. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● As we have discussed, if we normalize input features
○ It helps in converging faster
● If we normalize hidden layers also in deep neural network
○ Then it will speed up the learning
○ This is what we do in Batch normalization
■ We normalize hidden layers
○ Now let’s understand how do we do batch normalization in deep
neural networks
86. Training Deep Neural Nets
Batch Normalization - Algorithm
Algorithm
for T in 1 ……. number of mini batches:
Compute forward propagation for mini-batch X(T)
In each hidden layer normalize inputs
Use back propagation and update parameters
87. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Let’s say we have a simple network and
● Here normalizing input features helps in Calculate W and b more
efficiently
Step 1 - Calculate
mean
W, b
88. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Normalize the input features
Step 1 - Calculate
mean
W, b
89. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Normalize the input features
Step 1 - Calculate
mean
Step 2 - Calculate SD
W, b
90. Training Deep Neural Nets
Batch Normalization - Feature Scaling
x1
x2
x3
y^
● Normalize the input features
Step 1 - Calculate
mean
Step 2 - Calculate SD
Step 3 - Normalize
W, b
91. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
μB
is the mean,
evaluated over the
whole mini-batch B
92. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
σB
is the standard
deviation, evaluated over
the whole mini-batch B
93. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
mB
is number of
instances in the
mini-batch B
94. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
X(i)
is the normalized
output
95. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
ε is a tiny small number
to avoid division by zero
96. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
γ and β are parameters
which are learnt during
training
97. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
Z(i)
is the output of the
BN operations.
98. Training Deep Neural Nets
Batch Normalization - Feature Scaling
● Generalize previous steps for the deep neural network
● In general four
parameters are trained
for each
batch-normalized layer
○ μ (mean)
○ σ (SD)
○ γ and
○ β
99. Training Deep Neural Nets
Question
At the test time how do we test the deep neural network
trained with batch normalization as there will not be any mini
batch to compute the mean and standard deviation?
100. Training Deep Neural Nets
Answer
By computing the moving average of whole training set’s mean
and standard deviation during training
101. Training Deep Neural Nets
Follow code in the notebook to implement
Batch Normalization with TensorFlow
102. Training Deep Neural Nets
Batch Normalization
Drawbacks
● In batch normalization,
○ The neural network makes slower predictions
○ Due to the extra computations required at each layer
● If we need fast predictions
○ We should first check
■ How Plain ELU + He initialization performs
■ Before playing with batch normalization
104. Training Deep Neural Nets
Gradient Clipping
● We can reduce the exploding gradients problem
○ By clipping the gradients during backpropagation
○ So that they never exceed some threshold
○ This is called Gradient Clipping
105. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 1
● Specify threshold and optimizer
106. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 2
● Call the optimizer’s compute_gradients() method
107. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 3
● Create an operation to clip the gradients using
● clip_by_value() function
108. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 4
● Create an operation to apply the
○ Clipped gradients using the optimizer’s
○ apply_gradients() method
109. Training Deep Neural Nets
Gradient Clipping
>>> threshold = 1.0
>>> optimizer = tf.train.GradientDescentOptimizer(learning_rate)
>>> grads_and_vars = optimizer.compute_gradients(loss)
>>> capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
>>> training_op = optimizer.apply_gradients(capped_gvs)
How to apply Gradient Clipping?
Step 5
● Run this training_op at every training step
○ It will compute gradients
○ Clip them between –1.0 and 1.0, and apply them
○ Note that threshold is a hyperparameter and can be tuned
110. Training Deep Neural Nets
Gradient Clipping
Follow code in the notebook to create a simple
neural net for MNIST and add gradient clipping
112. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning
● It is not a good idea to train a very large DNN from scratch
● We should find an existing neural network if possible
○ Which accomplishes a similar task we are trying to tackle
● If we can find such network
○ Then just reuse the lower layers (early layers) of this network
○ This is called Transfer Learning
113. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning
● There are two major advantages of Transfer Learning
○ It speeds up training considerably
○ It requires much less training data
114. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning - Examples
● Let’s say we have found an existing DNN
○ That was trained to classify pictures
○ Into 100 different categories like
■ Animals,
■ Plants,
■ Vehicles and
■ Everyday objects
115. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning - Examples
● Now we want to train a DNN to classify specific types of vehicles
● These tasks are similar to existing DNN and
● We should try to reuse the pretrained layers of the existing network
Reusing pretrained
layers
116. Training Deep Neural Nets
Reusing Pretrained Layers
Transfer Learning
● If input pictures in our task do not have the same size as the one in the
existing network
● Then we have to add a preprocessing step to resize them to the size
○ As expected by the existing model
● Also transfer learning works only when inputs in our task
○ Have similar low-level features as in the existing model
117. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model
● If the original model was trained using TensorFlow
○ We can simply restore it and train it on the new task
118. Training Deep Neural Nets
Reusing Pretrained Layers
Let’s see example of how to reuse a
TensorFlow model
119. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 1
● To reuse the model
○ First we need to load graph structure
○ Using import_meta_graph()
>>> reset_graph()
>>> saver =
tf.train.import_meta_graph("model_ckps/my_model_final.ckpt.meta")
120. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 2
● Next, get handle on all operations we will need for training
● If we do not know graph structure, then
○ List all the operations using below code
>>> for op in tf.get_default_graph().get_operations():
print(op.name)
121. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 3
● Once we know which operations do we need then
○ We can get a handle on them using the graph’s
■ get_operation_by_name() or
■ get_tensor_by_name() methods
>>> X = tf.get_default_graph().get_tensor_by_name("X:0")
>>> y = tf.get_default_graph().get_tensor_by_name("y:0")
>>> accuracy =
tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
>>> training_op =
tf.get_default_graph().get_operation_by_name("GradientDescent")
122. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing a TensorFlow Model - Step 4
● Now we can start session, restore the model's state and continue
training on data
with tf.Session() as sess:
saver.restore(sess, "model_ckps/my_model_final.ckpt")
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples //
batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
y: mnist.test.labels})
print(epoch, "Test accuracy:", accuracy_val)
save_path = saver.save(sess, "model_ckps/my_new_model_final.ckpt")
123. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
124. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● In general, we restore only part of the original model
○ Especifically early layers
○ Let’s restore only hidden layers 1, 2 and 3
125. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Get all trainable variables in hidden layers 1 to 3
126. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Create a dictionary mapping the name of each variable in the original
model to its name in the new model
127. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Create a Saver that will restore only original model
128. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Create another Saver to save the entire new model, not just layers 1 to 3
129. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Start the session
130. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Restore the variables from the original model’s layers 1 to 3
131. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Train the new model
132. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing only part of the original model
● Restore only hidden layers 1, 2 and 3
Save the whole model
133. Training Deep Neural Nets
Reusing Pretrained Layers
Follow the complete code to restore only
hidden layers 1, 2 and 3 in the notebook
134. Training Deep Neural Nets
Reusing Pretrained Layers
Reusing Models from Other Frameworks
135. Training Deep Neural Nets
Reusing Models from Other Frameworks
● If the model was trained using another framework
○ Such as Theano
○ Then we need to load the weights manually
● Let’s see the example of
○ How we would copy the weight and biases from the first hidden layer
of a model trained using another framework
136. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 1
Load the weights from the other framework manually
137. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 2
● Find the initializer’s assignment operation for every variable
○ That we want to reuse
138. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 2
● The weights variable created by the tf.layers.dense() function is called
"kernel"
139. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 2
● Get the initialization value of every variable that we want to reuse
140. Training Deep Neural Nets
Reusing Models from Other Frameworks
● Step 3
● When we run the initializer, we replace the initialization values with the
ones we want, using a feed_dict
141. Training Deep Neural Nets
Reusing Models from Other Frameworks
Check the complete code of “reusing models
from other frameworks” in the notebook
143. Training Deep Neural Nets
Freezing the Lower Layers
● As discussed earlier, lower layers detects the low level details
○ So we can reuse these lower layers as they are
○ This is also called freezing lower layers
● While training a new DNN
○ We generally freeze lower-layer weights
○ So that higher-layer weights will be easier to train
○ Because they won’t have to learn a moving target
144. Training Deep Neural Nets
Freezing the Lower Layers
● To freeze the lower layers during training
○ We give the list of variables to optimizer after excluding the variables
from lower layers)
>>> train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
>>> training_op = optimizer.minimize(loss, var_list=train_vars)
145. Training Deep Neural Nets
Freezing the Lower Layers
● Freeze the lower layers - Step 1
>>> train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
>>> training_op = optimizer.minimize(loss, var_list=train_vars)
● Gets list of all the trainable variables
○ In the hidden layers 3 and 4 and
○ In the output layer
● This leaves out the variables
○ In the hidden layers 1 and 2
146. Training Deep Neural Nets
Freezing the Lower Layers
● Freeze the lower layers - Step 2
>>> train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
>>> training_op = optimizer.minimize(loss, var_list=train_vars)
● Next we provide this restricted list of trainable variables
○ To the optimizer’s minimize() function
● That’s it
○ Now hidden layer 1 and 2 are frozen
147. Training Deep Neural Nets
Reusing Pretrained Layers
Tweaking, Dropping, or Replacing the Upper
Layers
148. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
● While training a new DNN using existing DNN
○ The output layer of the original model is usually replaced
○ As it is most likely not useful at all for the new task
○ Also it may not even have the right number of
○ Outputs/classes for the new task
● Also the upper hidden layers of the original model
○ Are less likely to be useful
○ As compared to early layers
149. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
Question
How do we find out right number of layers to
reuse?
150. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
● Try freezing all the copied layers first
○ Then train the model and see how it performs
● Then try unfreezing one or two top hidden layers
○ So that backpropagation can tweak them
○ And see if performance improves
● The more training data we have, the more layers we can unfreeze
151. Training Deep Neural Nets
Tweaking, Dropping,Replacing the Upper Layers
● If we still can not get good performance and the training data is less
○ Then try dropping the top hidden layers
○ And freeze all remaining hidden layers again
● We can iterate until we find the right number of layer to reuse
● If we have plenty of training data then
○ Try replacing the top hidden layers
○ Instead of dropping them
○ Also add more hidden layers to get good performance
153. Training Deep Neural Nets
Model Zoos
● As we discussed we can reuse the existing pretrained neural network for
our new tasks
● But where can we find a trained neural network for the task similar to
ours?
154. Training Deep Neural Nets
Model Zoos
● The first place is to look our own catalog of models
○ This is why we should save all our models and
○ Organize them properly so that
○ We can retrieve them later
● Another option is to search in a model zoo
○ Many people after training their models
○ Release the trained models to the public
155. Training Deep Neural Nets
Model Zoos
● TensorFlow has its own model zoo available at
○ https://github.com/tensorflow/models
● It contains most of the image classification nets such as
○ VCG, Inception and ResNet
■ Including the code
■ The pretrained models and
■ Tools to download popular image datasets
156. Training Deep Neural Nets
Model Zoos
● Another popular model zoo is Caffe’s Model Zoo
○ https://github.com/BVLC/caffe/wiki/Model-Zoo
● It contains many computer vision models trained on various datasets
● We can also use below converter
○ To convert Caffe models to TensorFlow models
○ https://github.com/ethereon/caffe-tensorflow
158. Training Deep Neural Nets
Unsupervised Pretraining
● If we want to train a model for complex task
○ And we do not have much labeled training data
○ Also we could not find a pretrained model on similar task
● Then in this case how should we tackle the task?
159. Training Deep Neural Nets
Unsupervised Pretraining
● Try to gather more labeled training data
○ But if it is too hard or too expensive to get the training data
○ Then try to perform unsupervised pretraining
160. Training Deep Neural Nets
Unsupervised Pretraining
● If we have plenty of unlabelled training data then
○ Try to train the layers one by one
○ Starting with the lowest layer and then going up
○ Using an unsupervised feature detector algorithm such as
■ Restricted Boltzmann Machines (RBMs) or autoencoders
161. Training Deep Neural Nets
Unsupervised Pretraining
● Each layer is trained on the output of the
○ Previously trained layers
○ All layers except the one being trained are frozen
162. Training Deep Neural Nets
Unsupervised Pretraining
● Once all layers have been trained
○ We can fine-tune the network
○ Using supervised learning (with backpropagation)
● This is the long and tedious process
○ But often works well
163. Training Deep Neural Nets
Unsupervised Pretraining
● This technique was used by Geoffrey Hinton and his team in 2006
● It led to the revival of neural networks and the success of Deep Learning
● Until 2010, unsupervised pretraining (typically using RBMs)
○ Was the norm for deep nets
● Only after the vanishing gradients problem was alleviated
○ It became much more common to train
○ DNNs purely using backpropagation
164. Training Deep Neural Nets
Unsupervised Pretraining
● Unsupervised pretraining
○ Using autoencoders than RBM (Restricted Boltzmann Machines) is still
a good option when we have complex task to solve
■ And no similar pretrained model is available
■ And there is a little labeled training data but lot of unlabeled
training data is available
165. Training Deep Neural Nets
Reusing Pretrained Layers
Pretraining on an Auxiliary Task
166. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● Let’s say we want to build a system to recognize faces
● And as a training set
○ We may only have few pictures of each individual
○ Clearly not enough training set to train a good classifier
○ And gathering hundred of pictures of each person will not be practical
Solution??
167. Training Deep Neural Nets
Pretraining on an Auxiliary Task
Solution -
● We can download a lot of pictures of random people from internet
● And train a first neural network to detect
○ If two different pictures are of the same person
● Such a network would learn good feature detectors for faces
● So reusing its lower layers would allow us to train
○ A good face classifier
○ Using little training data which we had
168. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● It is cheap to gather unlabeled training data
○ Like in previous example
○ We could download images from internet for almost free
○ But it is quite expensive to label them
● A common technique is to
○ Label all the training examples as “good”
○ And then generate many new labeled training instances
○ By corrupting the good ones and
○ Label these corrupted instances as bad
169. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● And then we can train neural network
○ To classify these instances good or bad
● For example
○ Download millions of sentences
○ Then label all of them as “good”
○ Then randomly change a world in each sentence
○ And label the resulting sentence as “bad”
170. Training Deep Neural Nets
Pretraining on an Auxiliary Task
● Now if neural network can tell that
○ “The dog sleeps” is a good sentence and
○ “The dog they” is a bad sentence
○ Then it probably knows a lot about language
● Reusing its lower layers will help in many language processing tasks
172. Training Deep Neural Nets
Faster Optimizers
1. Training a deep neural network can be painfully slow
2. So far we have seen four ways to speedup training
2.1. Applying a good initialization strategy for the connection weights
2.2. Using a good activation function
2.3. Using Batch Normalization
2.4. Reusing parts of a pretrained network
173. Training Deep Neural Nets
Faster Optimizers
● Speed boost also comes from using a faster optimizer
○ Than the Gradient Descent optimizer
● Popular optimizers are
○ Momentum optimization
○ Nesterov Accelerated Gradient
○ AdaGrad
○ RMSProp and
○ Adam optimization
Increasing order of performance
174. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Analogy
● Imagine a bowling ball rolling down a gentle slope on a smooth surface
● It will start out slowly, but it will quickly pick up momentum until it
eventually reaches terminal velocity.
● This is the very simple idea behind Momentum optimization, proposed by
Boris Polyak in 1964
175. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How is Momentum optimization different from Gradient Descent
● Regular Gradient Descent will simply take small regular steps down
the slope, so it will take much more time to reach the bottom.
● Gradient Descent simply updates the weights θ by directly subtracting
the gradient of the cost function J(θ) with regards to the weights (∇θ
J(θ))
multiplied by the learning rate η.
176. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How is Momentum optimization different from Gradient Descent
● The equation of Gradient descent is: θ ← θ – η∇θJ(θ).
● It does not care about what the earlier gradients were. If the local
gradient is tiny, it goes very slowly
● Momentum optimization cares a great deal about what previous
gradients were
177. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How does Momentum optimization work ?
● At each iteration, it adds the local gradient to the momentum vector
m, multiplied by the learning rate η,
● And it updates the weights by simply subtracting this momentum vector.
178. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
How does Momentum optimization work ?
● In other words, the gradient is used as an acceleration, not as a speed.
● To simulate some sort of friction mechanism and prevent the momentum
from growing too large, the algorithm introduces a new
hyperparameter β, simply called the momentum, which must be set
between 0 (high friction) and 1 (no friction).
● A typical momentum value is 0.9.
179. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Advantages of Momentum optimization
● Gradient Descent goes down the steep slope quite fast, but then it takes
a very long time to go down the valley.
● Whereas Momentum optimization will roll down the bottom of the valley
faster and faster until it reaches the bottom (the optimum)
● In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
Momentum optimization helps a lot.
● It can also help roll past local optima.
180. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Disadvantage of Momentum optimization
● The one drawback of Momentum optimization is that it adds yet another
hyperparameter to tune.
● However, the momentum value of 0.9 usually works well in practice and
almost always goes faster than Gradient Descent.
181. Training Deep Neural Nets
Faster Optimizers - Momentum optimization
Implementing Momentum optimization
Implementing Momentum optimization in TensorFlow is easy : just replace
the GradientDescentOptimizer with the MomentumOptimizer
>>> optimizer =
tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9)
182. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● It is a small variant to Momentum optimization, proposed by Yurii
Nesterov in 1983, is almost always faster than vanilla Momentum
optimization.
183. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● The idea of Nesterov Momentum optimization, or Nesterov Accelerated
Gradient (NAG), is to
○ Measure the gradient of the cost function not at the local position but
slightly ahead in the direction of the momentum.
○ The only difference from vanilla Momentum optimization is that the
gradient is measured at θ + βm rather than at θ
184. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● This small tweak works because in general the momentum vector will be
pointing in the right direction (i.e., toward the optimum),
● So it will be slightly more accurate to use the gradient measured a bit
farther in that direction rather than using the gradient at the original
position
185. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● ∇1 represents the
gradient of the cost
function measured at the
starting point θ
● ∇2 represents the
gradient at the point
located at θ + βm
186. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● The Nesterov update
ends up slightly closer to
the optimum.
● After a while, these small
improvements add up
and NAG ends up being
significantly faster than
regular Momentum
optimization
187. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
● Note that when the
momentum pushes the
weights across a valley,
∇1 continues to push
further across the valley,
while ∇2 pushes back
toward the bottom of
the Valley.
● This helps reduce
oscillations and thus
converges faster.
188. Training Deep Neural Nets
Faster Optimizers - Nesterov Accelerated Gradient
Implementing Nesterov Accelerated Gradient
NAG will almost always speed up training compared to regular Momentum
optimization. To use it, simply set use_nesterov=True when creating the
MomentumOptimizer:
>>> optimizer =
tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9, use_nesterov=True)
189. Training Deep Neural Nets
Faster Optimizers - AdaGrad
● Gradient Descent starts by quickly going down the steepest slope, then
slowly goes down the bottom of the valley
● It would be nice if the algorithm could detect this early on and correct its
direction to point a bit more toward the global optimum
● The AdaGrad algorithm achieves this by scaling down the gradient vector
along the steepest dimensions
190. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● The first step accumulates the square of the gradients into the vector s
● The ⊗ symbol represents the element-wise multiplication
● This vectorized form is equivalent to computing si
← si
+ (∂ / ∂ θi
J(θ))2
for each element si
of the vector s
191. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● In other words, each si
accumulates the squares of the partial derivative
of the cost function with regards to parameter θi
● If the cost function is steep along the ith dimension, then si
will get larger
and larger at each iteration.
192. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● The second step is almost identical to Gradient Descent, but with one
big difference:
○ The gradient vector is scaled down by a factor of
○ The ⊘ symbol represents the element-wise division, and ϵ is a
smoothing term to avoid division by zero, typically set to 10–10
193. Training Deep Neural Nets
Faster Optimizers - AdaGrad
How does AdaGrad work ?
● This vectorized form is equivalent to computing for all
parameters θi
● This algorithm decays the learning rate, but it does so faster for steep
dimensions than for dimensions with gentler slopes.
● This is called an adaptive learning rate.
194. Training Deep Neural Nets
Faster Optimizers - AdaGrad
Advantages of AdaGrad
● It helps point the resulting updates more directly toward the global
optimum. One additional benefit is that it requires much less tuning of
the learning rate hyperparameter η
195. Training Deep Neural Nets
Faster Optimizers - AdaGrad
Disadvantages of AdaGrad
● AdaGrad often performs well for simple quadratic problems, but
unfortunately it often stops too early when training neural networks
● The learning rate gets scaled down so much that the algorithm ends
up stopping entirely before reaching the global optimum.
● So even though TensorFlow has an AdagradOptimizer, you should
not use it to train deep neural networks
● It may be efficient for simpler tasks such as Linear Regression
196. Training Deep Neural Nets
Faster Optimizers - RMSProp
● AdaGrad slows down a bit too fast and ends up never converging to the
global optimum
● The RMSProp algorithm fixes this by accumulating only the gradients
from the most recent iterations, as opposed to all the gradients since the
beginning of training
● It does so by using exponential decay in the first step
197. Training Deep Neural Nets
Faster Optimizers - RMSProp
● The decay rate β is typically set to 0.9
● It is once again a new hyperparameter, but this default value often works
well, so you may not need to tune it at all
198. Training Deep Neural Nets
Faster Optimizers - RMSProp
Implementing RMSProp
>>> optimizer =
tf.train.RMSPropOptimizer(learning_rate=learning_rate,
momentum=0.9, decay=0.9, epsilon=1e-10)
● Except on very simple problems, this optimizer almost always performs
much better than AdaGrad
● It also generally performs better than Momentum optimization and
Nesterov Accelerated Gradients
● In fact, it was the preferred optimization algorithm of many researchers
until Adam optimization came around
199. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Adam which stands for adaptive moment estimation, combines the ideas
of
○ Momentum optimization
○ And RMSProp
● Just like Momentum optimization it keeps track of an exponentially
decaying average of past gradients
● And just like RMSProp it keeps track of an exponentially decaying average
of past squared gradients
200. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Adam which stands for adaptive moment estimation, combines the ideas
of
○ Momentum optimization
○ And RMSProp
● Just like Momentum optimization it keeps track of an exponentially
decaying average of past gradients
● And just like RMSProp it keeps track of an exponentially decaying average
of past squared gradients
201. Training Deep Neural Nets
● If you just look at steps 1, 2, and 5, you will notice Adam’s close similarity
to both Momentum optimization and RMSProp.
Faster Optimizers - Adam Optimization
202. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● The only difference is that step 1 computes an exponentially decaying
average rather than an exponentially decaying sum
● But these are actually equivalent except for a constant factor, the
decaying average is just 1 – β1 times the decaying sum
203. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Steps 3 and 4 are somewhat of a technical detail
○ Since m and s are initialized at 0, they will be biased toward 0 at the
beginning of training
● So these two steps will help boost m and s at the beginning of training.
204. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● The momentum decay hyperparameter β1 is typically initialized to 0.9,
while the scaling decay hyperparameter β2 is often initialized to 0.999.
● As earlier, the smoothing term ϵ is usually initialized to a tiny number
such as 10–8
205. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
● Since Adam is an adaptive learning rate algorithm, like AdaGrad and
RMSProp, it requires less tuning of the learning rate hyperparameter η
● We can often use the default value η = 0.001, making Adam even easier
to use than Gradient Descent
206. Training Deep Neural Nets
Faster Optimizers - Adam Optimization
Implementing Adam Optimization in TensforFlow
>>> optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
207. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
How do we find a good learning rate ??
208. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● Finding a good learning rate can be tricky.
● If we set it way too high,
○ Training may actually diverge
● If you set it too low,
○ Training will eventually converge to the optimum, but it will take a
very long time.
209. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● If you set it slightly too high,
○ It will make progress very quickly at first,
○ But it will end up dancing around the optimum, never settling down
● We have to use an adaptive learning rate optimization algorithm such as
AdaGrad, RMSProp, or Adam,
○ But even then it may take time to settle
● If you have a limited computing budget, you may have to interrupt
training before it has converged properly, yielding a suboptimal solution
210. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● We may be able to find a fairly good learning rate by training your
network several times during just a few epochs using various learning
rates and comparing the learning curves
211. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
The ideal learning rate will learn quickly and converge to good solution
212. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● We can do better than a constant learning rate:
● If we start with a high learning rate and then reduce it once it stops
making fast progress
● We can reach a good solution faster than with the optimal constant
learning rate.
213. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
● There are many different strategies to reduce the learning rate during
training.
● These strategies are called learning schedules, the most common ones
are now discussed
214. Training Deep Neural Nets
Predetermined piecewise constant learning rate
● For example, set the learning rate to η0
= 0.1 at first, then to η1
= 0.001
after 50 epochs.
● Although this solution can work very well, it often requires fiddling
around to figure out the right learning rates and when to use them.
Faster Optimizers - Learning Rate Scheduling
215. Training Deep Neural Nets
Performance scheduling
● Measure the validation error every N steps, just like for early stopping
and reduce the learning rate by a factor of λ when the error stops
dropping.
Exponential scheduling
● Set the learning rate to a function of the iteration number t: η(t) = η0
10–t/r
. This works great, but it requires tuning η0
and r. The learning rate
will drop by a factor of 10 every r steps.
Faster Optimizers - Learning Rate Scheduling
216. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
Power scheduling
● Set the learning rate to η(t) = η0
(1 + t/r)–c
.
● The hyperparameter c is typically set to 1.
● This is similar to exponential scheduling, but the learning rate drops much
more slowly.
217. Training Deep Neural Nets
Faster Optimizers - Learning Rate Scheduling
Implementing a learning schedule with TensorFlow
>>> initial_learning_rate = 0.1
>>> decay_steps = 10000
>>> decay_rate = 1/10
>>> global_step = tf.Variable(0, trainable=False)
>>> learning_rate = tf.train.exponential_decay(initial_learning_rate,
global_step, decay_steps, decay_rate)
>>> optimizer = tf.train.MomentumOptimizer(learning_rate,
momentum=0.9)
>>> training_op = optimizer.minimize(loss, global_step=global_step)
Run it on Notebook
218. Training Deep Neural Nets
Implementing a learning schedule with TensorFlow
Understanding previous code
● After setting the hyperparameter values, we create a nontrainable
variable global_step (initialized to 0) to keep track of the current training
iteration number.
● Then we define an exponentially decaying learning rate, with η0
= 0.1 and
r = 10,000 using TensorFlow’s exponential_decay() function.
Faster Optimizers - Learning Rate Scheduling
219. Training Deep Neural Nets
Implementing a learning schedule with TensorFlow
Understanding previous code
● Next, we create an optimizer, in this example, a MomentumOptimizer
using this decaying learning rate.
● Finally, we create the training operation by calling the optimizer’s
minimize() method; since we pass it the global_step variable, it will
kindly take care of incrementing it.
Faster Optimizers - Learning Rate Scheduling
220. Training Deep Neural Nets
Since AdaGrad, RMSProp, and Adam optimization automatically reduce the learning rate during
training, it is not necessary to add an extra learning schedule.
For other optimization algorithms, using exponential decay or performance scheduling can
considerably speed up convergence.
Faster Optimizers - Learning Rate Scheduling
221. Training Deep Neural Nets
Faster Optimizers
● The conclusion is that we should always use Adam optimization
○ We really do not have to know about internals
○ Simply replace GradientDescentOptimizer with AdamOptimizer
○ With this small change training will be several times faster
223. Training Deep Neural Nets
"With four parameters I can fit an elephant and with five I can make him wiggle his trunk. "
-- John von Neumann, cited by Enrico Fermi in Nature 427
Overfitting
224. Training Deep Neural Nets
Avoid Overfitting Through Regularization
● Deep neural networks may have millions of parameters
● With so many parameters network
○ has a huge amount of freedom
○ And it can fit variety of complex datasets
○ Also it becomes prone to overfitting
225. Training Deep Neural Nets
Avoid Overfitting Through Regularization
● In this section, we will go through
○ Some of the most popular regularization techniques
○ For neural network and how to implement them with TensorFlow
■ Early stopping
■ ℓ1 and ℓ2 regularization
■ Dropout
■ Max-Norm Regularization and
■ Data augmentation
227. Training Deep Neural Nets
Avoid Overfitting Through Regularization
Early Stopping
228. Training Deep Neural Nets
Early Stopping
● As discussed in Machine Learning course
○ To avoid overfitting the training set
○ A great solution is early stopping
229. Training Deep Neural Nets
Early Stopping
● Stop training as soon as the validation error reaches a minimum
● This is called early stopping
230. Training Deep Neural Nets
Avoid Overfitting Through Regularization
ℓ1 and ℓ2 Regularization
231. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● Just like we apply ℓ1 and ℓ2 regularization for simple linear models
○ We can apply the same regularization to constrain
○ Neural network’s connection weights (not biases)
● To do so in TensorFlow
○ Simply add the appropriate regularization terms to cost function
232. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● For example, suppose
○ We have just one hidden layer with weights weights1 and
○ One output layer with weights weights2
○ Then we can apply ℓ1 regularization like this
233. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
Follow the code in the notebook to implement
ℓ1 regularization manually assuming we have
only one hidden layer
234. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● Manually applying ℓ1 regularization will not be convenient
○ If we have many layers
● In TensorFlow,
○ We can pass a regularization function to the tf.layers.dense()
function
○ Which computes regularization loss
235. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● This code creates a neural network
○ With two hidden layers and one output layer
○ It also creates nodes in the graph to compute
■ The ℓ1 regularization loss corresponding to each layer’s weights
○ TensorFlow automatically adds these nodes to a
■ Special collection containing all the regularization losses
236. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
● We just need to add
○ These regularization losses to overall loss, like below code
● Important
○ Don’t forget to add the regularization losses to overall loss
○ Else they will simply be ignored
>>> reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
>>> loss = tf.add_n([base_loss] + reg_losses, name="loss")
237. Training Deep Neural Nets
ℓ1 and ℓ2 Regularization
Follow the code in the notebook to implement
ℓ1 regularization in neural network with two
hidden layers
239. Training Deep Neural Nets
Dropout
● Dropout is the most popular
○ Regularization technique for deep neural networks
● It was proposed by G. E. Hinton in 2012
● Even the state-of-the-art neural networks
○ Got a 1–2% accuracy boost
○ Simply by adding dropout
● 1-2% accuracy boost may not sound like a lot
○ But when a model has 95% accuracy
○ Then 2% accuracy boost means dropping the error rate by 40%
○ (Going from 5% error to roughly 3%)
240. Training Deep Neural Nets
Dropout
● It is a fairly simple algorithm
● At every training step, every neuron
○ Including the input neurons but excluding the output neurons
○ Has a probability p of being temporarily “dropped out”
○ Meaning it will be entirely ignored during this training step
○ But it may be active during the next step
241. Training Deep Neural Nets
Dropout
● The hyperparameter p is called the dropout rate
○ And it is typically set to 50%
● After training, neurons don’t get dropped anymore
● Let’s understand this technique with an example
242. Training Deep Neural Nets
Dropout
Question
Would a company perform better if its
employees were told to toss a coin every
morning to decide whether or not to go to
work?
244. Training Deep Neural Nets
Dropout
● In that case company would be forced to adapt its organization
○ No single person will be responsible for filling the coffee machine
○ Or cleaning the office
○ Or performing any other critical tasks
● So these expertise would have to be spread across many people
● Employees would have to learn to
○ Cooperate with many of their coworkers
245. Training Deep Neural Nets
Dropout
Question
What will be the advantages of such a system?
246. Training Deep Neural Nets
Dropout
● The company would become much more resilient
● If one person quits, it would not make much difference
● Not sure if this idea will work for companies
○ But it definitely works for neural networks
247. Training Deep Neural Nets
Dropout
● Neurons trained with dropout
○ Can not co-adapt with their neighbouring neurons
○ They have to be as useful as possible on their own
○ They also can not rely excessively on just a few input neurons
○ They also must pay attention to each of their input neurons
○ As a result of this
■ They end up being less sensitive to slight changes in the inputs
● In the end we get a more robust network that generalizes better
248. Training Deep Neural Nets
Dropout
● To implement dropout using TensorFlow
○ Just apply dropout() function to the
○ Input layer and the output of every hidden layer
● During training dropout function() randomly drops some items
● After training, this function does nothing at all
>>> hidden1_drop = tf.layers.dropout(hidden1, dropout_rate,
training=training)
Just like batch normalization set training to
True during training and to False when testing
249. Training Deep Neural Nets
Dropout
Follow the code in the notebook to apply
dropout regularization to three-layer neural
network
250. Training Deep Neural Nets
Dropout
● If you observe model is overfitting
○ Then increase the dropout rate
● Else if model is underfitting
○ Then decrease the dropout rate
● It can also help to
○ Increase the dropout rate for large layers, and
○ Reduce it for small ones
251. Training Deep Neural Nets
Dropout
● Please note that dropout does
○ Tend to slow down convergence
○ But it results in a much better model when tuned properly
○ It is worth the extra time
252. Training Deep Neural Nets
Avoid Overfitting Through Regularization
Data Augmentation
253. Training Deep Neural Nets
Data Augmentation
● Data augmentation consists of
○ Generating new training instances from existing ones
○ Thereby increasing the size of the training set
● Let’s understand this with an example
● Let’s say we have to train a model to classify pictures of mushrooms
● Then we can slightly shift, rotate and resize
○ Every picture in the training set and
○ Add the resulting pictures to the training set
○ Thereby increasing the size of the training set
254. Training Deep Neural Nets
Data Augmentation
Generating new training instances of mushrooms from existing ones
255. Training Deep Neural Nets
Data Augmentation
● The trick is to generate realistic training instances
● A human should not be able to tell
○ Which instances were generated and which ones were not
● Moreover the modifications we apply should be learnable
256. Training Deep Neural Nets
Data Augmentation
● These newly added pictures
○ Forces the model to be more tolerant to the
■ Position,
■ Orientation, and
■ Size of the mushrooms in the picture
257. Training Deep Neural Nets
Data Augmentation
● If we want model to be more tolerant to the lightning conditions
○ We can also generate images with various contrasts and
○ Add them to the training set
258. Training Deep Neural Nets
Data Augmentation
● It is preferable to generate new images on the fly during training
○ Rather than wasting
■ Storage space and
■ Network bandwidth
259. Training Deep Neural Nets
Data Augmentation
● TensorFlow offers several image manipulation operations such as
○ Transposing(shifting)
○ Rotating
○ Resizing
○ Flipping
○ Cropping
○ Adjusting the brightness
○ Contrast
○ Saturation and
○ Hue
● These operations makes it easy to implement data augmentation for
image datasets
261. Training Deep Neural Nets
Practical Guidelines
● In this topic we have covered wide range of techniques
● And common question comes on which one to use
● The configuration shown below works fine in most of the cases
Default DNN Configuration
262. Training Deep Neural Nets
Practical Guidelines
● Also we should always look for the pretrained neural network solving the
similar problem
● The default configuration which we have shown in the last slide may be
tweaked as per the problem statement
○ If training set is too small then implement data augmentation
○ If we can’t find a good learning rate then trying adding
■ Learning schedule such as exponential decay
○ If we need a lightning fast model at run time
■ Then drop batch normalization and
■ Replace ELU with leaky ReLU
263. Training Deep Neural Nets
Practical Guidelines
● If we need a sparse model
○ Add some ℓ1 regularization
● With these guidelines
○ We can train deep neural networks
○ But if we use a single machine then
○ It make take days or months for training to complete
○ So be patient :)
○ Else train the model across many servers and GPUs