2. Machine Learning ?
● Used when it is difficult to program the correct behaviour by hand .
● Program an algorithm to automatically learn form data or experience .
● All ML algorithms try to find some set of parameters , which helps to solve some
specific task of interest .
3. Everything is a Search Problem .
All ML algorithms ( or most ) are doing same thing , in different ways .
They are searching for something , something that helps us to solve the task . That
something , we would like to refer as “parameters” or “weights” .
But good parameters are closely related to good features . Features are human
designed . So , all ML algorithms ( most ) are searching for only “parameters” , so
human involvement is inevitable .
4. Represent the data ( These are intuitions )
How we are able to classify a laptop as laptop and a mobile as mobile ?
Our brain represents something about laptop as something ( we don’t know exactly
what that is , say they are “signals” , some form of ) . So we are able to classify
these things .
Are we a classifier ? I assume so . Every human can be considered as a classifier .
They classify same things , but with different representation . I might classify
laptop as so , because of the shape and colour , someone else , might do this
because of the screen size . So , representations are different .
5. Intelligence ( 1 form ) is learning representation .
Assume the following 2 qs . Which is easy to answer ?
a.) What is most common word in a “shakespeare drama” ?
b.) If Alice liked harry potter, will she like Hunger Games ?
So , idea is we need a good representation , which must be able enough to represent
the data . ( Think about dimensionality reduction , why it works at least
sometimes)
8. Neural Networks
Another set of ML technique , used to find a set of parameters to perform the task of
interest .
A simple neural network has only one hidden layer . Input is mapped to a hidden
layer and from the hidden layer to the output layer .
When we have more that one hidden layer , we called it deep neural networks . ( not
deep learning )
9. So what is the difference between NN &
conventional ML ?Isn’t it better to do a task with the help of more workers ? Think of more cpu or
power to do some sort of task in a computer .
Relate these workers to “parameters” . What if we can use or have more parameters
to do some task to the data we have ? Isn’t it better . More parameters help us to
give a better representation , which in turn helps to perform the task of interest .
Most of the ML algorithms ( I don’t know, if there is any exception ) , try to find the
“parameters” , which is same as the number of features we have .
10. Contd . . . .
The number of parameters , learned by a neural networks is more compared to
other algos . This , is depending on the number of hidden nodes you are having .
In applications of “usual” machine learning, there is typically a strong focus on the
feature engineering part; the model learned by an algorithm can only be so good
as its input data. Of course, there must be sufficient discriminatory information in
our dataset, however, the performance of machine learning algorithms can suffer
substantially when the information is buried in meaningless features.
12. Deep learning ?
Deep Neural Networks are hard to learn , because of “vanishing gradient” problem .
( we will come to that later )
the more layers we add, the harder it becomes to “update” our weights because the
signal becomes weaker and weaker. Since our network’s weights can be terribly
off in the beginning (random initialization) it can become almost impossible to
parameterize a “deep” neural network with backpropagation.
13. Contd . . . .
deep learning can be noted as “clever” tricks or algorithms that can help with the
training of such “deep” neural network structures , which in turn acts as feature
detectors . The only algorithm ( as of my knowledge ) , which automatically learn
features efficiently from data ) is deep neural networks .
ie . Deep Neural Networks = Feature Detectors + Classifier
14. Training a Neural Network
Many techniques are there , but most widely accepted is Backpropogation , which is
a form of gradient descent .
How good your neural network performs ? It’s all about how good your loss
function is . What is a loss function ? Loss function = Cost function . ( Read the
pros and cons of different loss functions )
BP is a way to search for weights in high dimensional space ( not random search ) ,
which minimizes our cost function .
15. Backpropogation is a “clever” use of chain rule
Assume f(x) and x(t) are univariate functions then ,
19. Deep learning for NLP
All or most state of the art models in NLP are based on Deep Learning .
The reason is , anyone can do NLP with the help of deep learning . By saying so , it
is always a plus to have linguistic knowledge . But one doesn’t has to worry about
the grammatical patterns in a language , because we have Deep Nets to capture
automatic features for us .
Most widely used models are RNN , CNN , Variational Autoencoder etc .
20. Used CNN . Widely accepted paper “Text
Understanding from Scratch”“
27. Recurrent Neural Networks
Sequence to sequence learning is a very complicated task . Because , the length of
the sequence may vary and the model needs to have “memory” .
Markov Models are used mainly for this task , but it has constraint of specific
window size . ( look back to some x steps ) . It generates a lot of alternatives and
score them. Eg : ([I have, I had, I has, me have, me had}) and score them .
RNN does not have a window constraint , ( you can put that if you want ) .
34. Vanishing Gradient
Assume a network with two 2 hidden layers and other parameters fixed
∥δ1∥=0.07…‖δ1‖=0.07… and ∥δ2∥=0.31…‖δ2‖=0.31….
Assume network with 3 hidden layers .
0.012, 0.060, and 0.283
Assume network with 4 hidden layers
0.003, 0.017, 0.070, and 0.285
The pattern holds , early layers learn slower than later layers .
36. Why ?
Assume a very basic neural network
Backpropogation
39. Contd . . . .
Extreme case , assume derivative is 0.9 ( for tanh ) . What if the sequence is 50 steps
long ?
0.9 * 0.9 * ……. * 0.9 = 0.9 ** 50 = 0.00515377520732012
The gradients start vanishing . This is called vanishing gradient .
Will gradients explode ? Will we get large values ?
40. Exploding gradients
Assume that , derivative is more than 1 ( no way to happen , from the derivative for
non linear functions ) . What if 50 step long sequence results in ?
1.1 ** 50 = 117.39085287969579 . As , of my knowledge it happens only when , your
weight initialization is bad . Solution , clip the gradients . Do not let your gradients
explode . It is a hack , but it works and used widely in practice .
But what about vanishing gradients ? What if we use an “identity” function instead
of tanh or sigmoid ?
43. LSTM intuitions
Do we have to write everything ?
What if we keep everything , to the hidden units ? Will we be able to decode out the
useful information , when we need it .
What if we avoid non linearity . Allow the gradient to flows back .
44. LSTM ( Long Short term Memory )
Writing
Reading
forgetting