Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Abstract: This PDSG workshop introduces basic concepts of splitting a dataset for training a model in machine learning. Concepts covered are training, test and validation data, serial and random splitting, data imbalance and k-fold cross validation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Get to know in detail the termonologies of Random Forest with their types of algorithms used in the workflow along with their advantages and disadvantages of their predecessors.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
Abstract: This PDSG workshop introduces basic concepts on measuring accuracy of your trained model. Concepts covered are loss functions and confusion matrices.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Abstract: This PDSG workshop introduces basic concepts of splitting a dataset for training a model in machine learning. Concepts covered are training, test and validation data, serial and random splitting, data imbalance and k-fold cross validation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Get to know in detail the termonologies of Random Forest with their types of algorithms used in the workflow along with their advantages and disadvantages of their predecessors.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
Abstract: This PDSG workshop introduces basic concepts on measuring accuracy of your trained model. Concepts covered are loss functions and confusion matrices.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Simplilearn
This Deep Learning interview questions and answers presentation will help you prepare for Deep Learning interviews. This presentation is ideal for both beginners as well as professionals who are appearing for Deep Learning, Machine Learning or Data Science interviews. Learn what are the most important Deep Learning interview questions and answers and know what will set you apart in the interview process.
Some of the important Deep Learning interview questions are listed below:
1. What is Deep Learning?
2. What is a Neural Network?
3. What is a Multilayer Perceptron (MLP)?
4. What is Data Normalization and why do we need it?
5. What is a Boltzmann Machine?
6. What is the role of Activation Functions in neural network?
7. What is a cost function?
8. What is Gradient Descent?
9. What do you understand by Backpropagation?
10. What is the difference between Feedforward Neural Network and Recurrent Neural Network?
11. What are some applications of Recurrent Neural Network?
12. What are Softmax and ReLU functions?
13. What are hyperparameters?
14. What will happen if learning rate is set too low or too high?
15. What is Dropout and Batch Normalization?
16. What is the difference between Batch Gradient Descent and Stochastic Gradient Descent?
17. Explain Overfitting and Underfitting and how to combat them.
18. How are weights initialized in a network?
19. What are the different layers in CNN?
20. What is Pooling in CNN and how does it work?
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
Learn more at: https//www.simplilearn.com
Deep learning (also known as deep structured learning or hierarchical learning) is the application of artificial neural networks (ANNs) to learning tasks that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.
Basic knowhow of several techniques commonly used in deep learning and neural networks -- activation functions, cost functions, optimizers, regularization, parameter initialization, normalization, data handling, hyperparameter selection. Presented as lecture material for the course EE599 Deep Learning in Spring 2019 at University of Southern California.
This presentation is Part 2 of my September Lisp NYC presentation on Reinforcement Learning and Artificial Neural Nets. We will continue from where we left off by covering Convolutional Neural Nets (CNN) and Recurrent Neural Nets (RNN) in depth.
Time permitting I also plan on having a few slides on each of the following topics:
1. Generative Adversarial Networks (GANs)
2. Differentiable Neural Computers (DNCs)
3. Deep Reinforcement Learning (DRL)
Some code examples will be provided in Clojure.
After a very brief recap of Part 1 (ANN & RL), we will jump right into CNN and their appropriateness for image recognition. We will start by covering the convolution operator. We will then explain feature maps and pooling operations and then explain the LeNet 5 architecture. The MNIST data will be used to illustrate a fully functioning CNN.
Next we cover Recurrent Neural Nets in depth and describe how they have been used in Natural Language Processing. We will explain why gated networks and LSTM are used in practice.
Please note that some exposure or familiarity with Gradient Descent and Backpropagation will be assumed. These are covered in the first part of the talk for which both video and slides are available online.
A lot of material will be drawn from the new Deep Learning book by Goodfellow & Bengio as well as Michael Nielsen's online book on Neural Networks and Deep Learning as well several other online resources.
Bio
Pierre de Lacaze has over 20 years industry experience with AI and Lisp based technologies. He holds a Bachelor of Science in Applied Mathematics and a Master’s Degree in Computer Science.
https://www.linkedin.com/in/pierre-de-lacaze-b11026b/
Similar to Hands on machine learning with scikit-learn and tensor flow by ahmed yousry (20)
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
2. Agenda
❑Introduction to Artificial Neural Networks.
❑Training Deep Neural Nets.
❑Convolutional Neural Networks.
❑Recurrent Neural Network.
❑Reinforcement Learning.
3. Introduction to ANN
• First introduced back in 1943 by the Warren
McCulloch .
• Successes of ANNs until the 1960s.
• In the early 1980s there was a revival of
interest in ANNs as new network architectures.
• By the 1990s, powerful alternative Machine
Learning techniques.
4. Reasons why ANN is much more profound
impact
❑There is now a huge quantity of data.
❑The tremendous increase in computing power.
❑The training algorithms have been improved.
❑Theoretical limitations of ANNs have turned
out to be benign.
❑virtuous circle of funding and progress and
products.
7. The Perceptron
• One of the simplest ANN architectures, invented in
1957 by Frank Rosenblatt.
• It is based on a linear threshold unit (LTU).
Z = w1 x1 + w2 x2 + ⋯
+ wn xn = wT ・ x
hw(x) = step (Z)
= step (wT ・x)
8. Multioutput perceptron
❑ A Perceptron with two inputs and three outputs.
Note : No hidden layers in perceptron.
9. Training Algorithm
While epoch produces an error
Present network with next inputs from epoch
Err = T – O
If Err <> 0 then
Wj new = Wj old + LR * Ij * Err
End If
End While
• T: actual output , O: predicted output
• LR : learning rate , I :input
10. XOR classification problem and an MLP
that solves it
XOR Function
X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1)
2
2
2
2
-1
-1
Z1
Z2
Y
X1
X2
11. Multi-Layer Perceptron and
Backpropagation
• An MLP is composed of one input layer, one or more
layers of LTUs, called hidden layers, and one final
output layer
• When an ANN has two
or more hidden layers, it is called
a deep neural network (DNN).
12. A modern MLP (including ReLU and
softmax) for classification
13. Deep learning Problems
• Vanishing gradients problem (or the related
exploding gradients problem) lower layers
very hard to train.
• Second, with such a large network, training
would be extremely slow.
• Third, a model with millions of parameters
would severely risk overfitting the training
set.
14. Gradients problems
• Gradients often get smaller as the algorithm
progresses down to the lower layers.
• The Gradient Descent update leaves the lower layer
weights unchanged, and training never converges to
a good solution.
• This is called the vanishing gradients problem.
• The gradients can grow bigger and bigger, so many
layers get insanely large weight updates and the
algorithm diverges. This is the exploding gradients
problem
15. Solving the first problem(Van…)
A paper titled “Understanding the Difficulty of
Training Deep Feedforward Neural Networks” by
Xavier Glorot and Yoshua.
1. popular logistic sigmoid activation function.
2. using a normal distribution with a mean of 0
and a standard deviation of 1.
3. the hyperbolic tangent function has a mean
of 0 and behaves slightly better than the
logistic function in DNN.
16. Sigmoid activation function
you can see that when
inputs become large
(negative or positive),
the function saturates
at 0 or 1, with a
derivative extremely
close to 0.
1/1+e^-x
17. The problem of RELU (0,max)
• It suffers from a problem (dying ReLUs) during
training, some neurons effectively die.
• they stop outputting anything other than 0.
• In some cases, you may find that half of your
network’s neurons are dead training.
• To solve this problem, you may want to use a
variant of the ReLU function, such as the
leaky ReLU.
18. leaky ReLU (RReLU).
• leaky variants always outperformed the strict ReLU activation
function. In fact, setting α = 0.2 (huge leak) seemed to result in
better performance than α = 0.01 (small leak).
• They also evaluated the randomized leaky ReLU (RReLU).
• also evaluated the parametric leaky ReLU (PReLU),
19. Exponential linear unit (ELU)
• Outperformed all the ReLU variants in their
experiments: training time was reduced and the
neural network performed better on the test set.
20. Batch Normalization
• The technique consists of adding an operation in the model just before
the activation function of each layer.
• Simply zero-centering and normalizing the inputs, then scaling and
shifting the result using two new parameters per layer (one for scaling,
the other for shifting).
• In other words, this operation lets the model learn the optimal scale and
mean of the inputs for each layer.
• γ is the scaling parameter for the layer.
• β is the shifting parameter (offset) for the
layer.
22. Reusing Pretrained Layers
• It is generally not a good idea to train a very
large DNN from scratch.
• Try to find an existing neural network that
accomplishes a similar task.
• Reuse the lower layers of this network.
• This is called transfer learning.
23. Example
• DNN that was trained to
classify pictures into 100
different categories.
• You now want to train a
DNN to classify specific
types of vehicles.
• Freezing the Lower Layers
weights.
• Tweaking, Dropping, or
Replacing the Upper Layers.
25. Faster Optimizers
• Five ways to speed up training (and reach a better
solution):
➢ Applying a good initialization strategy for the connection weights.
➢ using a good activation function.
➢ Using Batch Normalization.
➢ Reusing parts of a pretrained network.
➢ Using a faster optimizer than the regular Gradient Descent
optimizer.
• the most popular ones: Momentum optimization, Nesterov
Accelerated Gradient, AdaGrad, RMSProp, and finally Adam
optimization.
26. Momentum Optimization Algorithm
• Gradient Descent simply updates the weights θ by directly subtracting
the gradient of the cost function J(θ) with regards to the weights
(∇θJ(θ)) multiplied by the learning rate η (equation 1)
• Momentum optimization cares a great deal about what previous
gradients were.
• It updates the weights by simply subtracting this momentum vector.
• A new hyperparameter β, simply called the momentum, which must be
set between 0 and 1, typically 0.9. (equation 2)
Gradient Descent (1) Momentum Optimization (2)
27.
28. Nesterov Momentum optimization
▪ The only difference from vanilla Momentum
optimization is that the gradient is measured
at θ + βm rather than at θ.
▪ This small tweak works because in general the
momentum vector will be pointing in the
right direction
▪ where ∇1 represents the gradient of the cost
function measured at the starting point θ,
and ∇2 represents the gradient at the point
located at θ + βm)
29. RMS Optimization
• Accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).
• It does so by using exponential decay in the first step.
• generally performs better than Momentum optimization and
Nesterov Accelerated Gradients.
• In fact, it was the preferred optimization algorithm of many
researchers until Adam optimization came around.
30. Adam Optimization
• Stands for adaptive moment estimation.
• Combines the ideas of Momentum optimization and RMSProp.
• Steps 3 and 4 are somewhat of a technical detail: since m and s
are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the
beginning of training.
Initialize β1 = 0.9, β2 =0.999, η = 0.001
term ϵ initialized to a tiny number 10–8 to avoid division by 0.
33. Learning rate techniques
❑ Predetermined piecewise
constant learning rate For example, set the learning rate to η0 = 0.1 at first,
then to η1 = 0.001 after 50 epochs.
❑ Performance scheduling
Measure the validation error every N steps (just like for early stopping) and
reduce the learning rate by a factor of λ when the error stops dropping.
❑ Exponential scheduling
Set the learning rate to a function of the iteration number t:
This works great, but it requires tuning η0 and r. The learning rate will drop by
a factor of 10 every r steps.
❑ Power scheduling
Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1.
This is similar to exponential scheduling, but the learning rate drops much more
slowly.
34. Dropout
❑ It is a fairly simple algorithm: at every training step, every neuron
(including the input neurons but excluding the output neurons) has
a probability p of being temporarily “dropped out,” meaning it will
be entirely ignored during this training step, but it may be active
during the next step
35. Data Augmentation
❑ Consists of generating new training (rotating, resizing, flipping,
and cropping) instances from existing ones, artificially boosting
the size of the training set.
❑ This will reduce overfitting, making this a regularization technique.
The trick is to generate realistic training instances.
36. Convolutional Neural Networks
❑ A convolutional neural network (or ConvNet) is a type of feed-forward
artificial neural network.
❑ The architecture of a ConvNet is designed to take advantage of the 2D
structure of an input image.
❑ A ConvNet is comprised of one or more convolutional layers (often
with a pooling step) and then followed by one or more fully connected
layers as in a standard multilayer neural network.
37. How CNN works
• For example, a ConvNet takes the input as an image which
can be classified as ‘X’ or ‘O’
38. ConvNet Layers
▪CONV layer will compute the output of neurons that are connected
to local regions in the input, each computing a dot product between
their weights and a small region they are connected to in the input
volume.
▪RELU layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. This leaves the size of the
volume unchanged.
▪POOL layer will perform a down sampling operation along the
spatial dimensions (width, height).
▪FC (i.e. fully-connected) layer will compute the class scores,
resulting in volume of size [1x1xN], where each of the N numbers
correspond to a class score, such as among the N categories.
39. Convolutional Layer - Filters
▪ The CONV layer’s parameters consist of a set of learnable
filters.
▪ Every filter is small spatially (along width and height), but
extends through the full depth of the input volume.
▪ During the forward pass, we slide (more precisely, convolve)
each filter across the width and height of the input volume and
compute dot products between the entries of the filter and the
input at any position.
40. Convolutional Layer - Filters
• Sliding the filter over the width and height of the input gives
2-dimensional activation map that responds to that filter at
every spatial position.
45. Pool Layer
▪ The pooling layers down-sample the previous layers feature
map.
▪ Its function is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and
computation in the network
▪ The pooling layer often uses the Max operation to perform
the down sampling process.
47. Fully connected layer
❑ Fully connected layers are the normal
flat feed-forward neural network
layers.
❑ These layers may have a non-linear
activation function or a softmax
activation in order to predict classes.
❑ To compute our output, we simply
rearrange the output matrices as a 1-
D array.
48. SoftMax operation
❑ A special kind of activation layer,
usually at the end of FC layer
Outputs
❑ Can be viewed as a fancy normalizer
(a.k.a. Normalized exponential
function)
❑ Produce a discrete probability
distribution vector
❑ Very convenient when combined
with cross-entropy loss
49. Recurrent Neural Network
❑ Some problems require previous history/context in order
to be able to give proper output (speech recognition,
stock forecasting, target tracking, etc.
❑ One way to do that is to just provide all the necessary
context in one "snap-shot" and use standard learning
➢ How big should the snap-shot be? Varies for different
instances of the problem.
✓ If the input sequences are of fixed length, or can be
easily padded to a fixed length, they can be
collapsed into a single input vector and any of the
standard pattern classification algorithms.
50. Sequential data
❑ There are many tasks that require learning a temporal sequence
of events
❑ These problems can be broken into 3 distinct types of tasks
➢ Sequence Recognition: Produce a particular output pattern
when a specific input sequence is seen. Applications:
Sentiment Analysis, handwriting recognition
➢ Sequence Reproduction: Generate the rest of a sequence
when the network sees only part of the sequence.
Applications: Time series prediction (stock market, sun spots,
etc), language model.
➢ Temporal Association: Produce a particular output sequence
in response to a specific input sequence. Applications:
machine translation, speech generation
✓ Recurrent networks is flexible enough to solve these
problems.
51. Recurrent Networks offer a lot of flexibility:
(2) Sequence output
(e.g. image
captioning takes an
image and outputs a
sentence of words).
(3) Sequence input
(e.g. sentiment
analysis where a
given sentence is
classified as
expressing positive
or negative
sentiment).
(4) Sequence input and
sequence output (e.g.
Machine Translation:an
RNN reads a sentence
in English and then
outputs a sentence in
French).
(5) Synced
sequence input
and output (e.g.
video
classification
where we wish to
label each frame
of the video).
(1) fixed-sized
input to fixed-
sized output
(e.g. image
classification)
52. Recurrent Neural Networks
❑ Recurrent neural network lets the
network dynamically learn how much
context it needs in order to solve the
problem.
❑ RNN is a multilayer NN with the previous
set of hidden unit activations feeding
back into the network along with the
inputs.
❑ RNNs have a “memory” which captures
information about what has been
calculated so far.
53. Recurrent neural networks
❑ Parameter sharing makes it possible to extend and apply the model to
examples of different lengths and generalized across them.
❑ It means local connections are shared (same weights) across different
temporal instances of the hidden units.
❑ If we have to define a different function Gt for each possible sequence
length, each with its own parameters, we would not get any
generalization to sequences of a size not seen in the training set.
54. Dynamic systems
❑ A means of describing how one state develops into another state
over the course of time.
❑ Consider the classical form of a dynamical system:
✓ Where st is the system state at time t, ƒ8 is a mapping function.
❑ The same parameters (the same function ƒ8) is used for all time
steps.
❑ Unfolding flow graph of such system is:
55. Dynamic systems
❑ Now consider a dynamical system driven by an external signal xt
The state st now contains information about the whole past sequence
57. Cost function
❑ The total loss for a given input/target sequence pair
(x, y), measured in cross entropy
L y, y^= Σ Lt = Σ −yt log y^t
• where yt is the category that should be associated
with time step t in the output sequence. y^tis the
predicted output.
58. Computing the gradient in
RNN
Using the generalized back-propagation one can obtain the so-
called Back-propagation Through Time (BPTT) algorithm.
We can then iterate backwards in time to back-propagate
gradients through time, from t = T − 1 down to t = 1,
noting that st (for t < T) has as descendants both ot and
st+1
59. Exploding or vanishing gradient
❑ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
❑ Even if each of these non-linear transformations is smooth. Their
composition might not be.
❑ The derivative (i.e. Jacobian matrix) through the whole composition
will tend to be either very small or very large.
❑ Example, suppose all numbers in the product are scalar and have the
same value α. If multiplication times T goes to ∞ then α^T = ∞ if α >
1 and αT = 0 if α < 1.
60. Gradient clipping
❑ Once the gradient value grows extremely large, it causes an overflow
(i.e. NaN) which is easily detectable at runtime.
❑ A simple heuristic solution that clips gradients to a small number
whenever they explode. That is, whenever they reach a certain
threshold, they are set back to a small number. as shown in Algorithm:
Error surface of a single hidden unit RNN
61. Facing the vanishing gradient problem
❑ Echo State Networks (ESN)
❑ Long delays
❑ Leaky Units
❑ Gated Recurrent Neural Networks
62. Echo State Networks (ESN)
❑ How do we set the input and recurrent weights so that a rich set of
histories can be represented in the recurrent neural network state?
❑ Answer: is to make the dynamical system associated with the recurrent
net nearly be on the edge of stability, i.e., more precisely with values
around 1 for the leading eigenvalue of the Jacobian of the state-to-
state transition function.
❑ ESNs proposed to fix the weights of the input→ hidden connections
and the hidden → hidden at carefully random values to make the
Jacobians slightly contractive. This is achieved by making the λ of the
weight matrix large but slightly less than 1.
❑ ESNs are only learn the hidden→output connections.
63. Skipping Connects (Long delays)
❑ Adding Longer-delay connections allow to
connect the past states to future states
through short paths
❑ if we have a connection every time steps. The
gradients will be vanishing or explosion after
number T of time steps as O(hT).
❑ instead, if we have recurrent connections
with a time-delay of D, gradients grow as
O(fiT/D) without vanishing but still may
explosion at T.
❑ because the number of effective steps is T/D.
This allows the learning algorithm to capture
longer dependencies
64. Gated Recurrent Neural Networks
❑ GRNNs are a special kind of RNN, capable of learning long-term
dependencies by having more persistent memory. Two popular
architectures:
➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber,
1997].
➢ Gated recurrent unit (GRU), [Cho et al., 2014]
❑ Applications: handwriting recognition (Graves et al., 2009), speech
recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting
generation (Graves, 2013), machine translation (Sutskever et al., 2014a),
image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al.,
2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
65. Long Short-Term Memory (LSTM)
❑ Standard RNNs have a very
simple repeating module
structure, such as a single tanh
layer.
❑ LSTMs also have this chain like
structure, but the repeating
module has a different
structure. Instead of having a
single neural network layer,
there are four, interacting in a
very special way.
66. Generate image caption
❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv
2014
❑ Use a CNN as an image encoder and transform it to a fixed-length
vector
❑ It is used as the initial hidden state of a “decoder” RNN that generates
the target sequence
67. Translate videos to sentences
❑ Venugopalan et al. arXiv 2014
❑ The challenge is to capture the joint dependencies of a sequence of
frames and a corresponding sequence of words
68. Reinforcement Learning
❑ One of the most exciting fields of Machine Learning today,
and also one of the oldest.
❑ It has been around since the 1950s, producing many
interesting applications over the years in particular in
games (e.g., TD-Gammon, a Backgammon playing program).
❑ Revolution took place in 2013 when researchers from an
English startup called DeepMind demonstrated a system
that could learn to play just about any Atari game from
scratch.
❑ DeepMind was bought by Google for over 500 million
dollars in 2014.
69. Learning to Optimize Rewards
❑ In Reinforcement Learning, a software agent makes
observations and takes actions within an environment, and
in return it receives rewards.
❑ Its objective is to learn to act in a way that will maximize its
expected long-term rewards.
❑ The agent acts in the environment and learns by trial and
error to maximize its pleasure and minimize its pain.
71. Policy Search
❑ The algorithm used by the software agent to determine its
actions is called policy.
❑ For example, the policy could be a neural network taking
observations as inputs and outputting the action to take
72. Stochastic policy
❑ The policy can be any algorithm you can think of, and it
does not even have to be deterministic.
❑ For example, consider a robotic vacuum cleaner whose
reward is the amount of dust it picks up in 30 minutes. Its
policy could be to move forward with some probability p
every second, or randomly rotate left or right with
probability 1 – p.
❑ The rotation angle would be a random angle between –r
and +r. Since this policy involves some randomness, it is
called a stochastic policy.
73. Introduction to OpenAI Gym
❑ One of the challenges of Reinforcement Learning is that in
order to train an agent, you first need to have a working
environment.
❑ If you want to program an agent that will learn to play an
Atari game, you will need an Atari game simulator.
❑ If you want to program a walking robot, then the
environment is the real world and you can directly train
your robot in that environment.
74. Example of environment
❑ CartPole environment . This is a 2D simulation in which a
cart can be accelerated left or right in order to balance a
pole placed on top of it
75. Neural Network Policies
❑ In the case of the CartPole
environment, there are just two
possible actions (left or right)
❑ For example, if it outputs 0.7,
then we will pick action 0 with
70% probability, and action 1 with
30% probability.
76. Markov Decision Processes
❑ In the early 20th century, the mathematician Andrey
Markov studied stochastic processes with no memory,
called Markov chains.
❑ Such a process has a fixed number of states, and it
randomly evolves from one state to another at each step.
❑ The probability for it to evolve from a state s to a state s′ is
fixed, and it depends only on the pair (s,s′), not on past
states (the system has no memory).
❑ Markov chains can have very different dynamics, and they
are heavily used in thermodynamics, chemistry, statistics,
and much more.
77. MDP Example
❑ Suppose that the process starts in
state s0, and there is a 70% chance
that it will remain in that state at the
next step.
❑ Eventually it is bound to leave that
state and never come back since no
other state points back to s0.
❑ If it goes to state s1, it will then most
likely go to state s2 (90% probability),
then immediately back to state s1
(with 100% probability).
79. Example: Grid World
❑ Noisy movement: actions do not always go as planned
❑ 80% of the time, the action North takes the agent North
(if there is no wall there)
❑ 10% of the time, North takes the agent West; 10% East
❑ If there is a wall in the direction the agent would have been
taken, the agent stays put.
❑ The agent receives rewards each time step
▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)
❑ Goal: maximize sum of rewards
81. Markov Decision Processes
❑ An MDP is defined by:
▪ A set of states s ∈ S
▪ A set of actions a ∈ A
▪ A transition function T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’|
s, a)
▪ Also called the model or the dynamics
▪ A reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’)
▪ A start state
▪ Maybe a terminal state
82. What is Markov about MDPs?
❑ “Markov” generally means that given the present state, the future
and the past are independent
❑ For Markov decision processes, “Markov” means action outcomes
depend only on the current state
❑ This is just like search, where the successor function could only
depend on the current state (not the history)
Andrey Markov
(1856-1922)
84. Policies
❑ In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal
❑ For MDPs, we want an optimal policy π*: S → A
▪ A policy π gives an action for each state
▪ An optimal policy is one that maximizes
expected utility if followed
▪ An explicit policy defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
non-terminals s
86. Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ More or less?
▪ Now or later?
[1, 2, 2] [2, 3, 4]or
[0, 0, 1] [1, 0, 0]or
87. Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially
Worth Now Worth Next
Step
Worth In Two
Steps
88. Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once
▪ Why discount?
▪ Sooner rewards probably do have higher
utility than later rewards
▪ Also helps our algorithms converge
▪ Example: discount of 0.5
▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
89. Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies (π depends on time left)
▪ Discounting: use 0 < γ < 1
▪ Smaller γ means smaller “horizon” – shorter term focus
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached