This document provides an overview of techniques for training deep neural networks. It discusses neural network parameters and hyperparameters, regularization strategies like L1 and L2 norm penalties, dropout, batch normalization, and optimization methods like mini-batch gradient descent. The key aspects covered are distinguishing parameters from hyperparameters, techniques for reducing overfitting like regularization and early stopping, and batch normalization for reducing internal covariate shift during training.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Tijmen Blankenvoort, co-founder Scyfer BV, presentation at Artificial Intelligence Meetup 15-1-2014. Introduction into Neural Networks and Deep Learning.
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Tijmen Blankenvoort, co-founder Scyfer BV, presentation at Artificial Intelligence Meetup 15-1-2014. Introduction into Neural Networks and Deep Learning.
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...Edureka!
This Edureka "What is Deep Learning" video will help you to understand about the relationship between Deep Learning, Machine Learning and Artificial Intelligence and how Deep Learning came into the picture. This tutorial will be discussing about Artificial Intelligence, Machine Learning and its limitations, how Deep Learning overcame Machine Learning limitations and different real-life applications of Deep Learning.
Below are the topics covered in this tutorial:
1. What Is Artificial Intelligence?
2. What Is Machine Learning?
3. Limitations Of Machine Learning
4. Deep Learning To The Rescue
5. What Is Deep Learning?
6. Deep Learning Applications
To take a structured training on Deep Learning, you can check complete details of our Deep Learning with TensorFlow course here: https://goo.gl/VeYiQZ
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session.
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaEdureka!
** AI & Deep Learning with Tensorflow Training: https://www.edureka.co/ai-deep-learni... **
This Edureka PPT on "Keras vs TensorFlow vs PyTorch" will provide you with a crisp comparison among the top three deep learning frameworks. It provides a detailed and comprehensive knowledge about Keras, TensorFlow and PyTorch and which one to use for what purposes. Following topics will be covered in this PPT:
Introduction to keras, Tensorflow, Pytorch
Parameters of Comparison
Level of API
Speed
Architecture
Ease of Code
Debugging
Community Support
Datasets
Popularity
Suitable use cases
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Simplilearn
Deep Learning covers all the essential Deep Learning frameworks that are necessary to build AI models. In this presentation, you will learn about the development of essential frameworks such as TensorFlow, Keras, PyTorch, Theano, etc. You will also understand the programming languages used to build the frameworks, the different companies that use these frameworks, the characteristics of these Deep Learning frameworks, and type of models that were built using these frameworks. Now, let us get started with understanding the different popular Deep Learning frameworks being used in industries.
Below are the different Deep Learning frameworks we'll be discussing in this presentation:
1. TensorFlow
2. Keras
3. PyTorch
4. Theano
5. Deep Learning 4 Java
6. Caffe
7. Chainer
8. Microsoft CNTK
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
1. Understand the concepts of TensorFlow, its main functions, operations, and the execution pipeline
2. Implement deep learning algorithms, understand neural networks and traverse the layers of data abstraction which will empower you to understand data like never before
3. Master and comprehend advanced topics such as convolutional neural networks, recurrent neural networks, training deep networks and high-level interfaces
4. Build deep learning models in TensorFlow and interpret the results
5. Understand the language and fundamental concepts of artificial neural networks
6. Troubleshoot and improve deep learning models
7. Build your own deep learning project
8. Differentiate between machine learning, deep learning, and artificial intelligence
Learn more at https://www.simplilearn.com/deep-learning-course-with-tensorflow-training
I have implemented various optimizers (gradient descent, momentum, adam, etc.) based on gradient descent using only numpy not deep learning framework like TensorFlow.
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
This meetup was recorded in San Francisco on Jan 9, 2019.
Video recording of the session can be viewed here: https://youtu.be/yG1UJEzpJ64
Description:
This fast-paced session starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next, we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful. If time permits, we'll look at the UAT, CLT, and the Fixed Point Theorem. (Bonus points if you know Zorn's Lemma, the Well-Ordering Theorem, and the Axiom of Choice.)
Oswald's Bio:
Oswald Campesato is an education junkie: a former Ph.D. Candidate in Mathematics (ABD), with multiple Master's and 2 Bachelor's degrees. In a previous career, he worked in South America, Italy, and the French Riviera, which enabled him to travel to 70 countries throughout the world.
He has worked in American and Japanese corporations and start-ups, as C/C++ and Java developer to CTO. He works in the web and mobile space, conducts training sessions in Android, Java, Angular 2, and ReactJS, and he writes graphics code for fun. He's comfortable in four languages and aspires to become proficient in Japanese, ideally sometime in the next two decades. He enjoys collaborating with people who share his passion for learning the latest cool stuff, and he's currently working on his 15th book, which is about Angular 2.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Part 1 of the Deep Learning Fundamentals Series, this session discusses the use cases and scenarios surrounding Deep Learning and AI; reviews the fundamentals of artificial neural networks (ANNs) and perceptrons; discuss the basics around optimization beginning with the cost function, gradient descent, and backpropagation; and activation functions (including Sigmoid, TanH, and ReLU). The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...Edureka!
This Edureka "What is Deep Learning" video will help you to understand about the relationship between Deep Learning, Machine Learning and Artificial Intelligence and how Deep Learning came into the picture. This tutorial will be discussing about Artificial Intelligence, Machine Learning and its limitations, how Deep Learning overcame Machine Learning limitations and different real-life applications of Deep Learning.
Below are the topics covered in this tutorial:
1. What Is Artificial Intelligence?
2. What Is Machine Learning?
3. Limitations Of Machine Learning
4. Deep Learning To The Rescue
5. What Is Deep Learning?
6. Deep Learning Applications
To take a structured training on Deep Learning, you can check complete details of our Deep Learning with TensorFlow course here: https://goo.gl/VeYiQZ
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session.
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaEdureka!
** AI & Deep Learning with Tensorflow Training: https://www.edureka.co/ai-deep-learni... **
This Edureka PPT on "Keras vs TensorFlow vs PyTorch" will provide you with a crisp comparison among the top three deep learning frameworks. It provides a detailed and comprehensive knowledge about Keras, TensorFlow and PyTorch and which one to use for what purposes. Following topics will be covered in this PPT:
Introduction to keras, Tensorflow, Pytorch
Parameters of Comparison
Level of API
Speed
Architecture
Ease of Code
Debugging
Community Support
Datasets
Popularity
Suitable use cases
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Simplilearn
Deep Learning covers all the essential Deep Learning frameworks that are necessary to build AI models. In this presentation, you will learn about the development of essential frameworks such as TensorFlow, Keras, PyTorch, Theano, etc. You will also understand the programming languages used to build the frameworks, the different companies that use these frameworks, the characteristics of these Deep Learning frameworks, and type of models that were built using these frameworks. Now, let us get started with understanding the different popular Deep Learning frameworks being used in industries.
Below are the different Deep Learning frameworks we'll be discussing in this presentation:
1. TensorFlow
2. Keras
3. PyTorch
4. Theano
5. Deep Learning 4 Java
6. Caffe
7. Chainer
8. Microsoft CNTK
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
1. Understand the concepts of TensorFlow, its main functions, operations, and the execution pipeline
2. Implement deep learning algorithms, understand neural networks and traverse the layers of data abstraction which will empower you to understand data like never before
3. Master and comprehend advanced topics such as convolutional neural networks, recurrent neural networks, training deep networks and high-level interfaces
4. Build deep learning models in TensorFlow and interpret the results
5. Understand the language and fundamental concepts of artificial neural networks
6. Troubleshoot and improve deep learning models
7. Build your own deep learning project
8. Differentiate between machine learning, deep learning, and artificial intelligence
Learn more at https://www.simplilearn.com/deep-learning-course-with-tensorflow-training
I have implemented various optimizers (gradient descent, momentum, adam, etc.) based on gradient descent using only numpy not deep learning framework like TensorFlow.
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
This meetup was recorded in San Francisco on Jan 9, 2019.
Video recording of the session can be viewed here: https://youtu.be/yG1UJEzpJ64
Description:
This fast-paced session starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next, we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful. If time permits, we'll look at the UAT, CLT, and the Fixed Point Theorem. (Bonus points if you know Zorn's Lemma, the Well-Ordering Theorem, and the Axiom of Choice.)
Oswald's Bio:
Oswald Campesato is an education junkie: a former Ph.D. Candidate in Mathematics (ABD), with multiple Master's and 2 Bachelor's degrees. In a previous career, he worked in South America, Italy, and the French Riviera, which enabled him to travel to 70 countries throughout the world.
He has worked in American and Japanese corporations and start-ups, as C/C++ and Java developer to CTO. He works in the web and mobile space, conducts training sessions in Android, Java, Angular 2, and ReactJS, and he writes graphics code for fun. He's comfortable in four languages and aspires to become proficient in Japanese, ideally sometime in the next two decades. He enjoys collaborating with people who share his passion for learning the latest cool stuff, and he's currently working on his 15th book, which is about Angular 2.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Part 1 of the Deep Learning Fundamentals Series, this session discusses the use cases and scenarios surrounding Deep Learning and AI; reviews the fundamentals of artificial neural networks (ANNs) and perceptrons; discuss the basics around optimization beginning with the cost function, gradient descent, and backpropagation; and activation functions (including Sigmoid, TanH, and ReLU). The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...ijistjournal
Machine learning [1] is concerned with the design and development of algorithms that allow computers to evolve intelligent behaviors based on empirical data. Weak learner is a learning algorithm with accuracy less than 50%. Adaptive Boosting (Ada-Boost) is a machine learning algorithm may be used to increase accuracy for any weak learning algorithm. This can be achieved by running it on a given weak learner several times, slightly alters data and combines the hypotheses. In this paper, Ada-Boost algorithm is used to increase the accuracy of the weak learner Naïve-Bayesian classifier. The Ada-Boost algorithm iteratively works on the Naïve-Bayesian classifier with normalized weights and it classifies the given input into different classes with some attributes. Maize Expert System is developed to identify the diseases of Maize crop using Ada-Boost algorithm logic as inference mechanism. A separate user interface for the Maize expert system consisting of three different interfaces namely, End-user/farmer, Expert and Admin are presented here. End-user/farmer module may be used for identifying the diseases for the symptoms entered by the farmer. Expert module may be used for adding rules and questions to data set by a domain expert. Admin module may be used for maintenance of the system.
Similar to NITW_Improving Deep Neural Networks (1).pptx (20)
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. - 1 -
Tips for Training Deep Neural Networks
by
Dr. Vikas Kumar
Department of Data Science and Analytics
Central University of Rajasthan, India
Email: vikas@curaj.ac.in
2. - 2 -
Outline
Neural Network Parameters
Parameters vs Hyperparameters
How to set network parameters
Bias / Variance Trade-off
Regularization Strategies
Batch normalization
Vanishing / Exploding gradients
Gradient Descent
Mini-batch Gradient Descent
Deep Learning
3. - 3 -
Neural Network Parameters
16 x 16 = 256
1
x
2
x
…
…
256
x
…
…
…
…
…
…
…
…
Ink → 1
No ink → 0
…
…
y1
y2
y1
0
0.1
0.7
0.2
y1 has the maximum value
Set the network parameters 𝜃 such that ……
Input:
y2 has the maximum value
Input:
is 1
is 2
is 0
How to let the
neural network
achieve this
𝜃 = 𝑊1
, 𝑏1
, 𝑊2
, 𝑏2
, ⋯ 𝑊𝐿
, 𝑏𝐿
…
…
4. - 4 -
Parameters vs Hyperparameters
A model parameter is a variable of the selected
model which can be estimated by fitting the given
data to the model.
Hyperparameter is a parameter from a prior
distribution; it captures the prior belief before data
is observed.
– These are the parameters that control the model
parameters
– In any machine learning algorithm, these parameters
need to be initialized before training a model.
Deep Learning
Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide
5. - 5 -
Deep Neural Network: Parameters vs
Hyperparameters
Parameters:
– 𝑊1, 𝑏1, 𝑊2, 𝑏2, ⋯ 𝑊𝐿, 𝑏𝐿
Hyperparameters:
– Learning rate 𝜶 in gradient descent
– Number of iterations in gradient descent
– Number of layers in a Neural Network
– Number of neurons per layer in a Neural Network
– Activations Functions
– Mini-batch size
– Regularizations parameters
Deep Learning
Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide
6. - 6 -
Train / Dev / Test sets
Hyperparameters tuning is a highly iterative process, where you
– start with an idea, i.e. start with a certain number of hidden layers,
certain learning rate, etc.
– try the idea by implementing it
– experiment how well the idea has worked
– refine the idea and iterate this process
Now how do we identify whether the idea is working? This is
where the train / dev / test sets come into play.
Deep Learning
Training
Set
Dev Set
Test Set
We train the model on the training data.
After training the model, we check how well it performs on the dev set.
When we have a final model, we evaluate it on the test set in order to get
an unbiased estimate of how well our algorithm is doing.
Data
7. - 7 -
Train / Dev / Test sets
Deep Learning
Training
Set
(60%)
Dev Set
(20%)
Test Set
(20%)
Training
Set
(70%)
Test Set
(20%)
Training
Set
(98%)
Dev Set (1%)
Test Set (1%)
Previously, when we had
small datasets, most
often the distribution of
different sets was
As the availability of data has
increased in recent years, we
can use a huge slice of it for
training the model
Data
8. - 8 -
Bias / Variance Trade-off
Make sure the distribution of dev/test set is
same as training set
– Divide the training, dev and test sets in such a
way that their distribution is similar
– Skip the test set and validate the model using
the dev set only
Deep Learning
Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/
We want our model to be just right, which
means having low bias and low variance.
Overfitting: If the dev set error is much
more than the train set error, the model is
overfitting and has a high variance
Underfitting: When both train and dev set
errors are high, the model is underfitting
and has a high bias
9. - 9 -
Overfitting in Deep Neural Nets
Deep neural networks contain multiple non-linear
hidden layers
– This makes them very expressive models that can learn
very complicated relationships between their inputs and
outputs.
– In other words, model learns even the tiniest details
present in the data.
But with limited training data, many of these
complicated relationships will be the result of sampling
noise
– So they will exist in the training set but not in real test
data even if it is drawn from the same distribution.
– So after learning all the possible patterns it can find, the
model tends to perform extremely well on the training set
but fails to produce good results on the dev and test sets.
Deep Learning
10. - 10 -
Regularization
Regularization is:
– “any modification to a learning algorithm to
reduce its generalization error but not its training
error”
– Reduce generalization error even at the expense
of increasing training error
E.g., Limiting model capacity is a regularization
method
Deep Learning
Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
12. - 12 -
Parameter Norm Penalties
The most traditional form of regularization applicable
to deep learning is the concept of parameter norm
penalties.
This approach limits the capacity of the model by
adding the penalty Ω 𝜃 to the objective function
resulting in:
min
𝜃
𝐽 = ℓ 𝜃 + 𝜆Ω 𝜃
𝜆 ∈ [0, ∞) is a hyperparameter that weights the
relative contribution of the norm penalty to the value
of the objective function.
Deep Learning
13. - 13 -
L2 Norm Parameter Regularization
Using L2 norm, we’re adding the constraints to the original
loss function, such that the weights of the network don’t
grow too large.
Ω 𝜃 = ||𝜃||2
2
Assuming there is no bias parameters, only weights
Ω 𝑤 = ||𝑤||2
2
= 𝑤11
2
+ 𝑤12
2
+ ⋯
By adding the regularized term, we’re fooling the model such
that it won’t drive the training error to zero, which in turn
reduces the complexity of the model.
Deep Learning
14. - 14 -
L1 Norm Parameter Regularization
L1 norm is another option that can be used to penalize the size
of model parameters.
L1 regularization on the model parameters w is:
Ω 𝑤 = ||𝑤||1 =
𝑖
|𝑤𝑖|
The L2 Norm penalty decays the components of the vector w
that do not contribute much to reducing the objective function.
On the other hand, the L1 norm penalty provides solutions that
are sparse.
This sparsity property can be thought of as a feature selection
mechanism.
Deep Learning
15. - 15 -
Early Stopping
When training models with sufficient representational
capacity to overfit the task, we often observe that training
error decreases steadily over time, while the error on the
validation set begins to rise again or remaining the same for
certain iterations, then there is no point in training the
model further.
This means we can obtain a model with better validation set
error (and thus, hopefully better test set error) by returning
to the parameter setting at the point in time with the lowest
validation set error
Deep Learning
16. - 16 -
Parameter Tying
Sometimes, we might not know which region the
parameters would lie in, but rather we known that there is
some dependencies between them.
Parameter Tying refers to explicitly forcing the parameters
of two models to be close to each other, through the norm
penalty.
||𝑾(𝑨) − 𝑾(𝑩)||
Here, 𝑾(𝑨) refers to the weights of the first model while
𝑾(𝑩) refers to those of the second one.
Deep Learning
17. - 17 -
Dropout
Dropout is a bagging method
– Bagging is a method of averaging over several
models to improve generalization
Impractical to train many neural networks since
it is expensive in time and memory
– It is a method of bagging applied to neural
networks
Dropout is an inexpensive but powerful method
of regularizing a broad family of models
Specifically, dropout trains the ensemble
consisting of all sub-networks that can be
formed by removing non-output units from an
underlying base network.
Deep Learning
18. - 18 -
Dropout - Intuitive Reason
When teams up, if everyone expect the partner
will do the work, nothing will be done finally.
However, if you know your partner will dropout,
you will do better.
When testing, no one dropout actually, so
obtaining good results eventually.
20. - 20 -
Dropout
Training:
Each time before computing the gradients
Each neuron has p% to dropout
Using the new network for training
The structure of the network is
changed.
Thinner!
21. - 21 -
Dropout
Testing:
No dropout
If the dropout rate at training is
p%, all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
22. - 22 -
w1 w2
x
1
x
2
w1 w2
x
1
x
2
w1 w2
x
1
x
2
w1 w2
x
1
x
2
z=w1x1+w2x2 z=w2x2
z=w1x1 z=0
x
1
x
2
w1 w2
1
2
1
2
x
1
x
2
w1 w2
z=w1x1+w2x
2
𝑧 =
1
2
𝑤1𝑥1 +
1
2
𝑤2𝑥2
Why the weights should multiply (1-p)% (dropout
rate) when testing?
23. - 23 -
Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set
1
Set
2
Set
3
Set
4
24. - 24 -
Dropout is a kind of ensemble.
Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average
25. - 25 -
Setting up your Optimization Problem
Deep Learning
26. - 26 -
Normalizing Inputs
The range of values of raw training data often varies widely
– Example: Has kids feature in {0,1}
– Value of car: $500-$100’sk
If one of the features has a broad range of values, the
distance will be governed by this particular feature.
– After, normalization, each feature contributes approximately
proportionately to the final distance.
In general, Gradient descent converges much faster with
feature scaling than without it.
Good practice for numerical stability for numerical
calculations, and to avoid ill-conditioning when solving
systems of equations.
Deep Learning
27. - 27 -
Feature Scaling
…
…
…
…
…
…
…
…
…
…
…
…
…
…
𝑥1 𝑥2
𝑥3 𝑥𝑟
𝑥𝑚
mean: 𝑚𝑖
standard
deviation: 𝜎𝑖
𝑥𝑖
𝑟
←
𝑥𝑖
𝑟
− 𝑚𝑖
𝜎𝑖
The means of all dimensions are 0,
and the variances are all 1
For each
dimension i:
𝑥1
1
𝑥2
1
𝑥1
2
𝑥2
2
In general, gradient descent converges much
faster with feature scaling than without it.
28. - 28 -
Internal Covariate Shift
• The first guy tells the second guy, “go water
the plants”, the second guy tells the third
guy, “got water in your pants”, and so on
until the last guy hears, “kite bang eat face
monkey” or something totally wrong.
• Let’s say that the problems are entirely
systemic and due entirely to faulty red cups.
Then, the situation is analogous to forward
propagation
• If can get new cups to fix the problem by
trial and error, it would help to have a
consistent way of passing messages in a
more controlled and standardized
(“normalized”) way. e.g: Same volume,
same language, etc
Deep Learning
“First layer parameters change and
so the distribution of the input to
your second layer changes”
31. - 31 -
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎 𝑧𝑖 =
𝑧𝑖
− 𝜇
𝜎 + 𝜀
𝜇 and 𝜎 depends
on 𝑧𝑖
𝑎3
𝑎2
𝑎1
Sigmoid
Sigmoid
Sigmoid
𝑧1
𝑧2
𝑧3
Batch Norms happens between computing Z and computing A. And the intuition is
that, instead of using the un-normalized value Z, you can use the normalized value Z
32. - 32 -
Batch normalization
Setting mean to 𝜇 = 𝟎 and 𝜎 = 𝟏 work for most of
the applications, but in actual implementation, But
we don't want the hidden units to always have
mean 0 and variance 1
𝑧𝑖
=
𝑧𝑖−𝜇
𝜎+𝜀
, we replace with the following
𝑧𝒏𝒐𝒓𝒎
𝑖 = 𝜸𝑧𝑖 + 𝜷
where 𝜸 and 𝜷 are learnable parameters.
𝒛𝒊
is the special case of 𝑧𝒏𝒐𝒓𝒎
𝑖
= 𝜸𝑧𝑖
+ 𝜷 at 𝜸 = 𝝈 +
𝜺 and 𝜷 = 𝝁
Deep Learning
33. - 33 -
Batch normalization at testing time
𝑥 𝑊1 𝑧 𝑧
𝑧
𝑧𝑖 = 𝛾⨀𝑧𝑖 + 𝛽
𝑧 =
𝑧 − 𝜇
𝜎
𝜇, 𝜎 are from
batch
𝛾, 𝛽 are network
parameters
We do not have batch at testing stage.
Ideal solution:
Computing 𝜇 and 𝜎 using the whole training dataset.
Practical solution:
Computing the moving average of 𝜇 and 𝜎 of the
batches during training.
Acc
Updates
𝜇1
𝜇100
𝜇300
34. - 34 -
Why does normalizing the data make the algorithm faster?
In the case of unnormalized data, the scale of
features will vary, and hence there will be a
variation in the parameters learnt for each
feature. This will make the cost function
asymmetric.
Deep Learning
𝑤
𝑏
𝐽
Unnormalized:
𝑤
𝑏
𝐽
Normalized:
Whereas, in the case of normalized data, the
scale will be the same and the cost function
will also be symmetric.
This makes it is easier for the gradient
descent algorithm to find the global minima
more quickly. And this, in turn, makes the
algorithm run much faster.
Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/
35. - 35 -
Vanishing / Exploding gradients
When you're training a very deep network,
sometimes the derivatives can get either very, very
big, and this makes training difficult.
1
x
2
x
……
……
𝑊1
𝑊𝐿−1
𝑊2
𝑊𝐿
𝑦
For simplicity, we assume bias ( 𝑏 = 0 ) at
every layer and the activation function is
linear
𝑍1
= 𝑊1
𝑥 𝑍2
= 𝑊2
𝑍1 𝑍𝐿−1
= 𝑊𝐿−1
𝑍𝐿−2 𝑦 = 𝑊𝐿
𝑍𝐿−1
𝑦 = 𝑊𝐿
𝑊𝐿−1
𝑊𝐿−2
… 𝑊2
𝑊1
𝑥
Assuming the entries in the weight matrix
are in the form
𝑊𝐿−1= 𝑊𝐿−2 = ⋯ 𝑊2 = 𝑊1 =
𝑝 0
0 𝑝 then, 𝑦 = 𝑊𝐿
×
𝑝 0
0 𝑝
𝐿−1
× 𝑥
Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
36. - 36 -
Vanishing / Exploding gradients
When you're training a very deep network,
sometimes the derivatives can get either very, very
big, and this makes training difficult.
1
x
2
x
……
……
𝑊1
𝑊𝐿−1
𝑊2
𝑊𝐿
𝑦
𝑍1
= 𝑊1
𝑥 𝑍2
= 𝑊2
𝑍1 𝑍𝐿−1
= 𝑊𝐿−1
𝑍𝐿−2 𝑦 = 𝑊𝐿
𝑍𝐿−1
𝑦 = 𝑊𝐿
𝑊𝐿−1
𝑊𝐿−2
… 𝑊2
𝑊1
𝑥
𝑦 = 𝑊𝐿
×
𝑝 0
0 𝑝
𝐿−1
× 𝑥
Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
if 𝑝 > 1 and the number of layers in the
network is large, the value of 𝑦 will explode.
Similarly, if 𝑝 < 1, the value of 𝑦 will be very
small. Hence, the gradient descent will take
very tinny step.
37. - 37 -
Solutions: Vanishing / Exploding gradients
Use a good initialization
– Random Initialization
The primary reason behind initializing the weights
randomly is to break symmetry.
We want to make sure that different hidden units
learn different patterns.
Do not use sigmoid for deep networks
– Problem: saturation
Deep Learning
Image Source: Pattern Recognition and Machine Learning, Bishop
38. - 38 -
ReLU
Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Vanishing gradient
problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
43. - 43 -
Gradient Descent
𝑤1
𝑤2
Assume there are only two
parameters w1 and w2 in a
network.
The colors represent the
value of C.
Randomly pick a
starting point 𝜃0
Compute the
negative
gradient at 𝜃0
−𝛻𝐶 𝜃0
𝜃0
−𝛻𝐶 𝜃0
Times the
learning rate 𝜂
−𝜂𝛻𝐶 𝜃0
𝛻𝐶 𝜃0
=
𝜕𝐶 𝜃0
/𝜕𝑤1
𝜕𝐶 𝜃0
/𝜕𝑤2
−𝜂𝛻𝐶 𝜃0
𝜃 = 𝑤1, 𝑤2
Error Surface
𝜃∗
44. - 44 -
Gradient Descent
𝑤1
𝑤2
Compute the
negative
gradient at 𝜃0
−𝛻𝐶 𝜃0
𝜃0
Times the
learning rate 𝜂
−𝜂𝛻𝐶 𝜃0
𝜃1
−𝛻𝐶 𝜃1
−𝜂𝛻𝐶 𝜃1
−𝛻𝐶 𝜃2
−𝜂𝛻𝐶 𝜃2
𝜃2
Eventually, we would
reach a minima ….. Randomly pick a
starting point 𝜃0
45. - 45 -
Gradient Descent
Gradient descent
– Pros
Guaranteed to converge to global minimum for convex error surface
Converge to local minimum for non-convex error surface
– Cons
Very slow
Intractable for dataset that do not fit in the memory
𝐶
𝑤1 𝑤2
Different initial point 𝜃0
Reach different minima, so
different results (non-convex)
47. - 47 -
Mini-batch
x1 NN
……
y1
𝑦1
𝐿1
x31 NN y31 𝑦31
𝐿31
x2 NN
……
y2
𝑦2
𝐿2
x16 NN y16 𝑦16
𝐿16
Pick the 1st batch
Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
Pick the 2nd batch
𝜃2
← 𝜃1
− 𝜂𝛻𝐶 𝜃1
…
Mini-
batch
Mini-
batch
C is different each
time when we
update parameters!
𝐶 = 𝐿1 + 𝐿31 + ⋯
𝐶 = 𝐿2 + 𝐿16 + ⋯
48. - 48 -
Mini-batch
x1 NN
……
y1
𝑦1
𝐶1
x31 NN y31 𝑦31
𝐶31
x2 NN
……
y2
𝑦2
𝐶2
x16 NN y16 𝑦16
𝐶16
Pick the 1st batch
Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
Pick the 2nd batch
𝜃2
← 𝜃1
− 𝜂𝛻𝐶 𝜃1
Until all mini-batches
have been picked
…
one epoch
Faster Better!
Mini-
batch
Mini-
batch
Repeat the above process
𝐶 = 𝐶1 + 𝐶31 + ⋯
𝐶 = 𝐶2 + 𝐶16 + ⋯
49. - 49 -
How can we choose a mini-batch size?
If the mini-batch size = m
– It is a batch gradient descent where all the
training examples are used in each iteration. It
takes too much time per iteration.
If the mini-batch size = 1
– It is called stochastic gradient descent, where
each training example is its own mini-batch.
– Since in every iteration we are taking just a
single example, it can become extremely noisy
and takes much more time to reach the global
minima.
If the mini-batch size is between 1 to m
– It is mini-batch gradient descent. The size of the
mini-batch should not be too large or too small.
Deep Learning
Source: https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
50. - 50 -
Acknowledgement
http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf
https://heartbeat.fritz.ai/deep-learning-best-practices-regularization-techniques-
for-better-performance-of-neural-network-94f978a4e518
https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.pdf
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/DNN%20tip.pptx
Accelerating Deep Network Training by Reducing Internal Covariate Shift, Jude W.
Shavlik
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pptx
Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.
On Predictive and Generative Deep Neural Architectures, Prof. Swagatam Das,
ISICAL
Deep Learning