Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled. Also known as deep neural learning or deep neural network
gSkeletonClu - Revealing density-based clustering structure from the core-con...Danilo Oliveira
Presentation about the clustering algorithm gSkeletonClu.
Huang, J., Sun, H., Song, Q., Deng, H., & Han, J. (2013). Revealing density-based clustering structure from the core-connected tree of a network. IEEE Transactions on Knowledge and Data Engineering, 25(8), 1876–1889. http://doi.org/10.1109/TKDE.2012.100
Localized methods in graph mining exploit the local structures in a graph instead attempting to find global structures. These are widely successful at all sorts of problems including community detection, label propagation, and a few others.
gSkeletonClu - Revealing density-based clustering structure from the core-con...Danilo Oliveira
Presentation about the clustering algorithm gSkeletonClu.
Huang, J., Sun, H., Song, Q., Deng, H., & Han, J. (2013). Revealing density-based clustering structure from the core-connected tree of a network. IEEE Transactions on Knowledge and Data Engineering, 25(8), 1876–1889. http://doi.org/10.1109/TKDE.2012.100
Localized methods in graph mining exploit the local structures in a graph instead attempting to find global structures. These are widely successful at all sorts of problems including community detection, label propagation, and a few others.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
K-means Clustering Algorithm with Matlab Source codegokulprasath06
K-means algorithm
The most common method to classify unlabeled data.
Also Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Using Topological Data Analysis on your BigDataAnalyticsWeek
Synopsis:
Topological Data Analysis (TDA) is a framework for data analysis and machine learning and represents a breakthrough in how to effectively use geometric and topological information to solve 'Big Data' problems. TDA provides meaningful summaries (in a technical sense to be described) and insights into complex data problems. In this talk, Anthony will begin with an overview of TDA and describe the core algorithm that is utilized. This talk will include both the theory and real world problems that have been solved using TDA. After this talk, attendees will understand how the underlying TDA algorithm works and how it improves on existing “classical” data analysis techniques as well as how it provides a framework for many machine learning algorithms and tasks.
Speaker:
Anthony Bak, Senior Data Scientist, Ayasdi
Prior to coming to Ayasdi, Anthony was at Stanford University where he did a postdoc with Ayasdi co-founder Gunnar Carlsson, working on new methods and applications of Topological Data Analysis. He completed his Ph.D. work in algebraic geometry with applications to string theory at the University of Pennsylvania and ,along the way, he worked at the Max Planck Institute in Germany, Mount Holyoke College in Germany, and the American Institute of Mathematics in California.
The slides for the equation deviation of recurrent neural network (RNN), back-propagation through time and Sequence-to-sequence (Seq2Seq) models in image/video captioning tasks. Used in group paper reading in University of Sydney.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
K-means Clustering Algorithm with Matlab Source codegokulprasath06
K-means algorithm
The most common method to classify unlabeled data.
Also Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Using Topological Data Analysis on your BigDataAnalyticsWeek
Synopsis:
Topological Data Analysis (TDA) is a framework for data analysis and machine learning and represents a breakthrough in how to effectively use geometric and topological information to solve 'Big Data' problems. TDA provides meaningful summaries (in a technical sense to be described) and insights into complex data problems. In this talk, Anthony will begin with an overview of TDA and describe the core algorithm that is utilized. This talk will include both the theory and real world problems that have been solved using TDA. After this talk, attendees will understand how the underlying TDA algorithm works and how it improves on existing “classical” data analysis techniques as well as how it provides a framework for many machine learning algorithms and tasks.
Speaker:
Anthony Bak, Senior Data Scientist, Ayasdi
Prior to coming to Ayasdi, Anthony was at Stanford University where he did a postdoc with Ayasdi co-founder Gunnar Carlsson, working on new methods and applications of Topological Data Analysis. He completed his Ph.D. work in algebraic geometry with applications to string theory at the University of Pennsylvania and ,along the way, he worked at the Max Planck Institute in Germany, Mount Holyoke College in Germany, and the American Institute of Mathematics in California.
The slides for the equation deviation of recurrent neural network (RNN), back-propagation through time and Sequence-to-sequence (Seq2Seq) models in image/video captioning tasks. Used in group paper reading in University of Sydney.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
Binary classification and linear separators. Perceptron, ADALINE, artifical neurons. Artificial neural networks (ANNs), activation functions, and universal approximation theorem. Linear versus non-linear classification problems. Typical tasks, architectures and loss functions. Gradient descent and back-propagation. Support Vector Machines (SVMs), soft-margins and kernel trick. Connexions between ANNs and SVMs.
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
Introduction to Inception and Xception
video: https://youtu.be/V0dLhyg5_Dw
Papers:
Going Deeper with Convolutions
Rethinking the Inception Architecture for Computer Vision
Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
Xception: Deep Learning with Depthwise Separable Convolutions
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She has been a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Microsoft Faculty Fellowship, ARO Young Investigator Award, NSF CAREER Award, and IBM Fran Allen PhD fellowship.
Algorithmic Data Science = Theory + PracticeTwo Sigma
Obtaining actionable insights from large datasets requires the use methods that must be, at once, fast, scalable, and statistically sound. This is the field of study of algorithmic data science, a discipline at the border of computer science and statistics. In this talk I outline the fundamental questions that motivate research in this area, present a general framework to formulate many problems in this field, introduce the challenges in balancing theoretical and statistical correctness with practical efficiency, and I show how sampling-based algorithms are extremely effective at striking the correct balance in many situations, giving examples from social network analysis and pattern mining. I will conclude with some research directions and areas for future explorations.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
The Internet of Things (IoT) is a revolutionary concept that connects everyday objects and devices to the internet, enabling them to communicate, collect, and exchange data. Imagine a world where your refrigerator notifies you when you’re running low on groceries, or streetlights adjust their brightness based on traffic patterns – that’s the power of IoT. In essence, IoT transforms ordinary objects into smart, interconnected devices, creating a network of endless possibilities.
Here is a blog on the role of electrical and electronics engineers in IOT. Let's dig in!!!!
For more such content visit: https://nttftrg.com/
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
6th International Conference on Machine Learning & Applications (CMLA 2024)
Deep learning networks
1. 1/67
CS7015 (Deep Learning) : Lecture 9
Greedy Layerwise Pre-training, Better activation functions, Better weight
initialization methods, Batch Normalization
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
2. 2/67
Module 9.1 : A quick recap of training deep neural
networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
5. 3/67
x
σ
w
y
We already saw how to train this network
w = w − η w where,
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
6. 3/67
x
σ
w
y
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
7. 3/67
x
σ
w
y
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
8. 3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
9. 3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
10. 3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
w2 = w2 − η w2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
11. 3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
w2 = w2 − η w2
w3 = w3 − η w3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
12. 3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η w where,
w =
∂L (w)
∂w
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η w1
w2 = w2 − η w2
w3 = w3 − η w3
where, wi = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
14. 4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
15. 4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
16. 4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
17. 4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
18. 4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Notice that wi is proportional to the correspond-
ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
19. 4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1; hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate w1 using chain rule:
∂L (w)
∂w1
=
∂L (w)
∂y
.
∂y
∂a3
.
∂a3
∂h2
.
∂h2
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
=
∂L (w)
∂y
∗ ............... ∗ h0
In general,
wi =
∂L (w)
∂y
∗ ............... ∗ hi−1
Notice that wi is proportional to the correspond-
ing input hi−1 (we will use this fact later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
20. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
21. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
22. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
23. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
24. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
25. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
26. 5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deep
and wide?
How do you calculate w2 =?
It will be given by chain rule applied across mul-
tiple paths (We saw this in detail when we studied
back propagation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
27. 6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
28. 6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
29. 6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters
(recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for wi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
31. 7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
32. 7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
However when used for really deep
networks it was not very successful
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
33. 7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
However when used for really deep
networks it was not very successful
In fact, till 2006 it was very hard to
train very deep networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
34. 7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popular
by Rumelhart et.al in 1986
However when used for really deep
networks it was not very successful
In fact, till 2006 it was very hard to
train very deep networks
Typically, even after a large number
of epochs the training did not con-
verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
36. 9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
37. 9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
38. 9/67
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in
2006
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
39. 10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
40. 10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
41. 11/67
Consider the deep neural network
shown in this figure
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
42. 11/67
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
43. 11/67
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
44. 11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
45. 11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
We refer to this as an unsupervised
objective because it does not involve
the output label (y) and only uses the
input data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
46. 11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Consider the deep neural network
shown in this figure
Let us focus on the first two layers of
the network (x and h1)
We will first train the weights
between these two layers using an un-
supervised objective
Note that we are trying to reconstruct
the input (x) from the hidden repres-
entation (h1)
We refer to this as an unsupervised
objective because it does not involve
the output label (y) and only uses the
input data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
47. 11/67
ˆx
h1
x
reconstruct x
min
1
m
m
i=1
n
j=1
(ˆxij − xij)2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
48. 11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
We now fix the weights in layer 1 and
repeat the same process with layer 2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
49. 11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
We now fix the weights in layer 1 and
repeat the same process with layer 2
At the end of this step, the weights in
layer 2 are trained such that h2 cap-
tures an abstract representation of h1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
50. 11/67
ˆh1
h2
h1
x
min
1
m
m
i=1
n
j=1
(ˆh1ij − h1ij )2
At the end of this step, the weights
in layer 1 are trained such that h1
captures an abstract representation
of the input x
We now fix the weights in layer 1 and
repeat the same process with layer 2
At the end of this step, the weights in
layer 2 are trained such that h2 cap-
tures an abstract representation of h1
We continue this process till the last
hidden layer (i.e., the layer before the
output layer) so that each successive
layer captures an abstract represent-
ation of the previous layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
51. 12/67
x1 x2 x3
min
θ
1
m
m
i=1
(yi − f(xi))2
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
52. 12/67
x1 x2 x3
min
θ
1
m
m
i=1
(yi − f(xi))2
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Note that, in effect we have initial-
ized the weights of the network us-
ing the greedy unsupervised objective
and are now fine tuning these weights
using the supervised objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
53. 13/67
Why does this work better?
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
54. 13/67
Why does this work better?
Is it because of better optimization?
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
55. 13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
56. 13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some
(among many) existing studies1,2
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
57. 14/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
58. 15/67
What is the optimization problem that we are trying to solve?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
59. 15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
60. 15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
61. 15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =
1
m
m
i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
62. 16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
63. 16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
64. 16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
65. 16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
66. 16/67
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
However, if the capacity of the net-
work is small, unsupervised pre-
training helps
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
67. 17/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
68. 18/67
What does regularization do?
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
69. 18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
70. 18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
71. 18/67
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
L-2 regularization: prevents most
weights from taking large values
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
72. 19/67
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
73. 19/67
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
74. 19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
75. 19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
We can think of this unsupervised ob-
jective as an additional constraint on
the optimization problem
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
76. 19/67
Unsupervised objective:
Ω(θ) =
1
m
m
i=1
n
j=1
(xij − ˆxij)2
We can think of this unsupervised ob-
jective as an additional constraint on
the optimization problem
Supervised objective:
L (θ) =
1
m
m
i=1
(yi − f(xi))2
Indeed, pre-training constrains the
weights to lie in only certain regions
of the parameter space
Specifically, it constrains the weights
to lie in regions where the character-
istics of the data are captured well (as
governed by the unsupervised object-
ive)
This unsupervised objective ensures
that that the learning is not greedy
w.r.t. the supervised objective (and
also satisfies the unsupervised object-
ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
77. 20/67
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
78. 20/67
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
One accepted hypothesis is that pre-
training leads to better weight ini-
tializations (so that the layers cap-
ture the internal characteristics of the
data)
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
79. 21/67
So what has happened since 2006-2009?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
80. 22/67
Deep Learning has evolved
Better optimization algorithms
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
81. 22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
82. 22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
83. 22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
85. 24/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
86. 25/67
Before we look at activation functions, let’s try to answer the following question:
“What makes Deep Neural Networks powerful ?”
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
88. 26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
89. 26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
90. 26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
91. 26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x
In other words we will be constrained
to learning linear decision boundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
92. 26/67
h0 = x
y
σ
σ
σ
a1
h1
a2
h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid in
each layer by a simple linear trans-
formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a linear
transformation of x
In other words we will be constrained
to learning linear decision boundaries
We cannot learn arbitrary decision
boundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
93. 26/67
In particular, a deep linear neural
network cannot learn such boundar-
ies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
94. 26/67
In particular, a deep linear neural
network cannot learn such boundar-
ies
But a deep non linear neural net-
work can indeed learn such bound-
aries (recall Universal Approximation
Theorem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
95. 27/67
Now let’s look at some non-linear activation functions that are typically used in
deep neural networks (Much of this material is taken from Andrej Karpathy’s
lecture notes 1)
1
http://cs231n.github.io
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
97. 28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
98. 28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
99. 28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
∂σ(x)
∂x
= σ(x)(1 − σ(x))
(you can easily derive it)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
100. 28/67
Sigmoid
σ(x) = 1
1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
∂σ(x)
∂x
= σ(x)(1 − σ(x))
(you can easily derive it)
Let us see what happens if we use sig-
moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
102. 29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
103. 29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
What is the consequence of this ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
104. 29/67
h0 = x
σ
σ
σ
σ
a1
h1
a2
h2
a3
h3
a4
h4
a3 = w2h2
h3 = σ(a3)
While calculating w2 at some point
in the chain rule we will encounter
∂h3
∂a3
=
∂σ(a3)
∂a3
= σ(a3)(1 − σ(a3))
What is the consequence of this ?
To answer this question let us first
understand the concept of saturated
neuron ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
105. 30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
106. 30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
ation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
107. 30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
ation?
Well it would be 0 (you can see it from
the plot or from the formula that we
derived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
108. 30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause the
gradient to vanish.
A sigmoid neuron is said to have sat-
urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
ation?
Well it would be 0 (you can see it from
the plot or from the formula that we
derived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
109. 31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
110. 31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
111. 31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
The neurons will saturate very
quickly
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
112. 31/67
Saturated neurons thus cause the
gradient to vanish.
w1 w2 w3 w4
σ( 4
i=1 wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
4
i=1 wixi
y
But why would the neurons saturate
?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
The neurons will saturate very
quickly
The gradients will vanish and the
training will stall (more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
114. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
115. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Why is this a problem??
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
116. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
117. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
∂a3
∂w1
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
∂a3
∂w2
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
118. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
119. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
So if the first common term (in red)
is positive (negative) then both w1
and w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
120. 32/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 and
w2
w1 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h21
w2 =
∂L (w)
∂y
∂y
h3
∂h3
∂a3
h22
Note that h21 and h22 are between
[0, 1] (i.e., they are both positive)
So if the first common term (in red)
is positive (negative) then both w1
and w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients at
a layer are positive or all the gradients
at a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
121. 33/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
This restricts the possible update dir-
ections
w2
w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
122. 33/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
This restricts the possible update dir-
ections
w2
w1
(Not possible) Quadrant in which
all gradients are
+ve
(Allowed)
Quadrant in which
all gradients are
-ve
(Allowed)
(Not possible)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
123. 33/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
This restricts the possible update dir-
ections
w2
w1
(Not possible) Quadrant in which
all gradients are
+ve
(Allowed)
Quadrant in which
all gradients are
-ve
(Allowed)
(Not possible)
Now imagine:
this is the
optimal w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
124. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
125. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
126. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
127. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
128. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
129. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
130. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
131. 34/67
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered
And lastly, sigmoids are compu-
tationally expensive (because of
exp (x))
w2
w1
starting from this
initial position
only way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
132. 35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
133. 35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
134. 35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
135. 35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
136. 35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
The gradient still vanishes at satura-
tion
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
137. 35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range
[-1,1]
Zero centered
What is the derivative of this func-
tion?
∂tanh(x)
∂x
= (1 − tanh2
(x))
The gradient still vanishes at satura-
tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
138. 36/67
ReLU
f(x) = max(0, x)
Is this a non-linear function?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
139. 36/67
ReLU
f(x) = max(0, x)
Is this a non-linear function?
Indeed it is!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
140. 36/67
ReLU
f(x) = max(0, x + 1) − max(0, x − 1)
Is this a non-linear function?
Indeed it is!
In fact we can combine two ReLU
units to recover a piecewise linear ap-
proximation of the sigmoid function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
141. 37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
142. 37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
143. 37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
In practice converges much faster
than sigmoid/tanh1
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
144. 38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
145. 38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
146. 38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Now consider the given network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
147. 38/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x
= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a large
gradient causes the bias b to be updated to a
large negative value?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
148. 39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
149. 39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
150. 39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero
The weights w1, w2 and b will not get updated
[ there will be a zero term in the chain rule]
w1 =
∂L (θ)
∂y
.
∂y
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
151. 39/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but during
backpropagation even the gradient ∂h1
∂a1
would
be zero
The weights w1, w2 and b will not get updated
[ there will be a zero term in the chain rule]
w1 =
∂L (θ)
∂y
.
∂y
∂a2
.
∂a2
∂h1
.
∂h1
∂a1
.
∂a1
∂w1
The neuron will now stay dead forever!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
152. 40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
153. 40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high
It is advised to initialize the bias to a
positive value (0.01)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
154. 40/67
x1 x2 1
y
w1 w2 b
h1
a1
a2
w3
In practice a large fraction of ReLU
units can die if the learning rate is set
too high
It is advised to initialize the bias to a
positive value (0.01)
Use other variants of ReLU (as we
will soon see)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
156. 41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
157. 41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Computationally efficient
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
158. 41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Computationally efficient
Close to zero centered ouputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
159. 41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures that
at least a small gradient will flow
through)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
161. 42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a small
gradient will flow through
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
162. 42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a small
gradient will flow through
Close to zero centered outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
163. 42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex
− 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a small
gradient will flow through
Close to zero centered outputs
Expensive (requires computation of
exp(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
164. 43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
165. 43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
166. 43/67
Maxout Neuron
max(wT
1 x + b1, wT
2 x + b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Doubles the number of parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
168. 44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
169. 44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
170. 44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
172. 46/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
173. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
174. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
175. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
176. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
What happens if we initialize all
weights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
177. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize all
weights to 0?
All neurons in layer 1 will get the
same activation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
178. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
179. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
182. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
and a12 = a12
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
183. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during back
propagation?
w11 =
∂L (w)
∂y
.
∂y
∂h11
.
∂h11
∂a11
.x1
w21 =
∂L (w)
∂y
.
∂y
∂h12
.
∂h12
∂a12
.x1
but h11 = h12
and a12 = a12
∴ w11 = w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
184. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
185. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
186. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
187. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
188. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )
This is known as the symmetry
breaking problem
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
189. 47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get the
same update and remain equal
Infact this symmetry will never break
during training
The same is true for w12 and w22
And for all weights in layer 2 (infact,
work out the math and convince your-
self that all the weights in this layer
will remain equal )
This is known as the symmetry
breaking problem
This will happen if all the weights in
a network are initialized to the same
value
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
190. 48/67
We will now consider a feedforward
network with:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
191. 48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
192. 48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
193. 48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
194. 48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
each layer has 500 neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
195. 48/67
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
each layer has 500 neurons
we will run forward propagation on
this network with different weight ini-
tializations
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
196. 49/67
Let’s try to initialize the weights to
small random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
197. 49/67
tanh activation functions
Let’s try to initialize the weights to
small random numbers
We will see what happens to the ac-
tivation across different layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
198. 49/67
sigmoid activation functions
Let’s try to initialize the weights to
small random numbers
We will see what happens to the ac-
tivation across different layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
199. 50/67
What will happen during back
propagation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
200. 50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
201. 50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
202. 50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
203. 50/67
What will happen during back
propagation?
Recall that w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
They will all be close to 0 (vanishing
gradient problem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
204. 51/67
Let us try to initialize the weights to
large random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
205. 51/67
tanh activation with large weights
Let us try to initialize the weights to
large random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
206. 51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
207. 51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Most activations have saturated
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
208. 51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
209. 51/67
sigmoid activations with large weights
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
They will all be close to 0 (vanishing
gradient problem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
210. 52/67
Let us try to arrive at a more principled
way of initializing weights
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
211. 52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
212. 52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
213. 52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
214. 52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
215. 52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
=
n
i=1
V ar(xi)V ar(w1i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
216. 52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs and
weights]
[Assuming V ar(xi) = V ar(x)∀i ]
[Assuming
V ar(w1i) = V ar(w)∀i]
Let us try to arrive at a more principled
way of initializing weights
s11 =
n
i=1
w1ixi
V ar(s11) = V ar(
n
i=1
w1ixi) =
n
i=1
V ar(w1ixi)
=
n
i=1
(E[w1i])2
V ar(xi)
+ (E[xi])2
V ar(w1i) + V ar(xi)V ar(w1i)
=
n
i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
217. 53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
218. 53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
219. 53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
The variance of S1i will be large
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
220. 53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
The variance of S1i will be large
What would happen if nV ar(w) → 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
221. 53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1
?
The variance of S1i will be large
What would happen if nV ar(w) → 0?
The variance of S1i will be small
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
222. 54/67
Let us see what happens if we add one
more layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
223. 54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
224. 54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
225. 54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
226. 54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
227. 54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
228. 54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
∝ [nV ar(w)]2
V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
229. 54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add one
more layer
Using the same procedure as above
we will arrive at
V ar(s21) =
n
i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
∝ [nV ar(w)]2
V ar(x)
Assuming weights across all layers
have the same variance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
232. 55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
233. 55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
234. 55/67
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
235. 55/67
V ar(az) = a2
(V ar(z))
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
= n ∗
1
n
V ar(z) = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
236. 55/67
V ar(az) = a2
(V ar(z))
In general,
V ar(ski) = [nV ar(w)]k
V ar(x)
To ensure that variance in the output of any
layer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussian
and scale them by 1√
n
then, we have :
nV ar(w) = nV ar(
z
√
n
)
= n ∗
1
n
V ar(z) = 1 ← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
237. 56/67
Let’s see what happens if we use this
initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
238. 56/67
tanh activation
Let’s see what happens if we use this
initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
240. 57/67
However this does not work for ReLU
neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
241. 57/67
However this does not work for ReLU
neurons
Why ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
242. 57/67
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
243. 57/67
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Intuitively this happens because the
range of ReLU neurons is restricted
only to the positive half of the space
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
244. 58/67
Indeed when we account for this
factor of 2 we see better performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
245. 58/67
Indeed when we account for this
factor of 2 we see better performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
247. 60/67
We will now see a method called batch normalization which allows us to be less
careful about initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
248. 61/67
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
249. 61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
250. 61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
251. 61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
252. 61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant change
in the distribution of h3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
253. 61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant change
in the distribution of h3
In other words what would happen if across mini-
batches the distribution of h3 keeps changing
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
254. 61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-
malization let us consider a deep network
Let us focus on the learning process for the weights
between these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant change
in the distribution of h3
In other words what would happen if across mini-
batches the distribution of h3 keeps changing
Would the learning process be easy or hard?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
255. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
256. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
257. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
258. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
259. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
260. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
But how do we compute E[sik] and Var[sik]?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
261. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
But how do we compute E[sik] and Var[sik]?
We compute it from a mini-batch
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
262. 62/67
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
ˆsik = sik−E[sik]
√
var(sik)
But how do we compute E[sik] and Var[sik]?
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-
bution of the inputs at different layers does not
change across batches
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
263. 63/67
This is what the deep network will look like with
Batch Normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
264. 63/67
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
265. 63/67
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
266. 63/67
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Hence we can backpropagate through this layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
267. 64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
268. 64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
269. 64/67
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
270. 64/67
γk and βk are additional
parameters of the network.
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
What happens if the network learns:
γk = var(xk)
βk = E[xk]
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
271. 64/67
γk and βk are additional
parameters of the network.
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
What happens if the network learns:
γk = var(xk)
βk = E[xk]
We will recover sik
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
272. 64/67
γk and βk are additional
parameters of the network.
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y(k) = γk ˆsik + β(k)
What happens if the network learns:
γk = var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-
meters the network can learn to recover sik if that
is more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
273. 65/67
We will now compare the performance with and without batch normalization on
MNIST data using 2 layers....
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
294. 67/67
2016-17: Still exciting times
Even better optimization methods
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
295. 67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
296. 67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9