SlideShare a Scribd company logo
1 of 56
Neural networks for Natural
Language Processing - 1
a general overview
Machine Learning ?
● Used when it is difficult to program the correct behaviour by hand .
● Program an algorithm to automatically learn form data or experience .
● All ML algorithms try to find some set of parameters , which helps to solve some
specific task of interest .
Everything is a Search Problem .
All ML algorithms ( or most ) are doing same thing , in different ways .
They are searching for something , something that helps us to solve the task . That
something , we would like to refer as “parameters” or “weights” .
But good parameters are closely related to good features . Features are human
designed . So , all ML algorithms ( most ) are searching for only “parameters” , so
human involvement is inevitable .
Represent the data ( These are intuitions )
How we are able to classify a laptop as laptop and a mobile as mobile ?
Our brain represents something about laptop as something ( we don’t know exactly
what that is , say they are “signals” , some form of ) . So we are able to classify
these things .
Are we a classifier ? I assume so . Every human can be considered as a classifier .
They classify same things , but with different representation . I might classify
laptop as so , because of the shape and colour , someone else , might do this
because of the screen size . So , representations are different .
Intelligence ( 1 form ) is learning representation .
Assume the following 2 qs . Which is easy to answer ?
a.) What is most common word in a “shakespeare drama” ?
b.) If Alice liked harry potter, will she like Hunger Games ?
So , idea is we need a good representation , which must be able enough to represent
the data . ( Think about dimensionality reduction , why it works at least
sometimes)
Representation examples
Word embeddings ( Hard to construct these by hand , learn from data )
Image Representation
Neural Networks
Another set of ML technique , used to find a set of parameters to perform the task of
interest .
A simple neural network has only one hidden layer . Input is mapped to a hidden
layer and from the hidden layer to the output layer .
When we have more that one hidden layer , we called it deep neural networks . ( not
deep learning )
So what is the difference between NN &
conventional ML ?Isn’t it better to do a task with the help of more workers ? Think of more cpu or
power to do some sort of task in a computer .
Relate these workers to “parameters” . What if we can use or have more parameters
to do some task to the data we have ? Isn’t it better . More parameters help us to
give a better representation , which in turn helps to perform the task of interest .
Most of the ML algorithms ( I don’t know, if there is any exception ) , try to find the
“parameters” , which is same as the number of features we have .
Contd . . . .
The number of parameters , learned by a neural networks is more compared to
other algos . This , is depending on the number of hidden nodes you are having .
In applications of “usual” machine learning, there is typically a strong focus on the
feature engineering part; the model learned by an algorithm can only be so good
as its input data. Of course, there must be sufficient discriminatory information in
our dataset, however, the performance of machine learning algorithms can suffer
substantially when the information is buried in meaningless features.
Deep Neural networks
Deep learning ?
Deep Neural Networks are hard to learn , because of “vanishing gradient” problem .
( we will come to that later )
the more layers we add, the harder it becomes to “update” our weights because the
signal becomes weaker and weaker. Since our network’s weights can be terribly
off in the beginning (random initialization) it can become almost impossible to
parameterize a “deep” neural network with backpropagation.
Contd . . . .
deep learning can be noted as “clever” tricks or algorithms that can help with the
training of such “deep” neural network structures , which in turn acts as feature
detectors . The only algorithm ( as of my knowledge ) , which automatically learn
features efficiently from data ) is deep neural networks .
ie . Deep Neural Networks = Feature Detectors + Classifier
Training a Neural Network
Many techniques are there , but most widely accepted is Backpropogation , which is
a form of gradient descent .
How good your neural network performs ? It’s all about how good your loss
function is . What is a loss function ? Loss function = Cost function . ( Read the
pros and cons of different loss functions )
BP is a way to search for weights in high dimensional space ( not random search ) ,
which minimizes our cost function .
Backpropogation is a “clever” use of chain rule
Assume f(x) and x(t) are univariate functions then ,
Assume a simple Loss Function
1-layer NN
Use calculus to find parameters “w” and “b”
Reverse Mode Differentiation ( Chain Rule
Application )
Deep learning for NLP
All or most state of the art models in NLP are based on Deep Learning .
The reason is , anyone can do NLP with the help of deep learning . By saying so , it
is always a plus to have linguistic knowledge . But one doesn’t has to worry about
the grammatical patterns in a language , because we have Deep Nets to capture
automatic features for us .
Most widely used models are RNN , CNN , Variational Autoencoder etc .
Used CNN . Widely accepted paper “Text
Understanding from Scratch”“
Word2vec ( Mikolov Google )
Skip thought vectors
Neural Storyteller
Machine Translation ( Attention Mechanism )
Breakthrough paper for Attention Modelling
Show , Attend and Tell
Attention
Recurrent Neural Networks
Sequence to sequence learning is a very complicated task . Because , the length of
the sequence may vary and the model needs to have “memory” .
Markov Models are used mainly for this task , but it has constraint of specific
window size . ( look back to some x steps ) . It generates a lot of alternatives and
score them. Eg : ([I have, I had, I has, me have, me had}) and score them .
RNN does not have a window constraint , ( you can put that if you want ) .
Contd . . . .
Forward Pass
Neural networks for natural language processing
Backpropogation
This also makes sense ( but confusing to me )
Two Issues
Vanishing Gradient
Large Vocabulary ( Softmax is computationally costly )
Vanishing Gradient
Assume a network with two 2 hidden layers and other parameters fixed
∥δ1∥=0.07…‖δ1‖=0.07… and ∥δ2∥=0.31…‖δ2‖=0.31….
Assume network with 3 hidden layers .
0.012, 0.060, and 0.283
Assume network with 4 hidden layers
0.003, 0.017, 0.070, and 0.285
The pattern holds , early layers learn slower than later layers .
Neural networks for natural language processing
Why ?
Assume a very basic neural network
Backpropogation
Neural networks for natural language processing
Derivatives of Activation functions
Contd . . . .
Extreme case , assume derivative is 0.9 ( for tanh ) . What if the sequence is 50 steps
long ?
0.9 * 0.9 * ……. * 0.9 = 0.9 ** 50 = 0.00515377520732012
The gradients start vanishing . This is called vanishing gradient .
Will gradients explode ? Will we get large values ?
Exploding gradients
Assume that , derivative is more than 1 ( no way to happen , from the derivative for
non linear functions ) . What if 50 step long sequence results in ?
1.1 ** 50 = 117.39085287969579 . As , of my knowledge it happens only when , your
weight initialization is bad . Solution , clip the gradients . Do not let your gradients
explode . It is a hack , but it works and used widely in practice .
But what about vanishing gradients ? What if we use an “identity” function instead
of tanh or sigmoid ?
From where LSTM come from ?
What is LSTM ?
LSTM intuitions
Do we have to write everything ?
What if we keep everything , to the hidden units ? Will we be able to decode out the
useful information , when we need it .
What if we avoid non linearity . Allow the gradient to flows back .
LSTM ( Long Short term Memory )
Writing
Reading
forgetting
How the gates affect the cell state ?
Why LSTM works ?
Neural networks for natural language processing
Some Reasons .
Neural networks for natural language processing
LSTM - Forward Propogation
LSTM -
BP
Other ways
Input reversal , in RNN encoder decoder framework .
Use Leaky ReLU
GRU ( Gated Recurrent Units ) .
Language Modeling
Neural Machine Translation
Trending
RNN LSTM , with attention Mechanism
Facebook Convnet for Machine Translation
Neural Turing Machine and Memory Networks
GAN , CNN to ResNet , Highway Networks , Coupled LSTM etc . . . . . . . . . . . . . . . .
References

More Related Content

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Neural networks for natural language processing

  • 1. Neural networks for Natural Language Processing - 1 a general overview
  • 2. Machine Learning ? ● Used when it is difficult to program the correct behaviour by hand . ● Program an algorithm to automatically learn form data or experience . ● All ML algorithms try to find some set of parameters , which helps to solve some specific task of interest .
  • 3. Everything is a Search Problem . All ML algorithms ( or most ) are doing same thing , in different ways . They are searching for something , something that helps us to solve the task . That something , we would like to refer as “parameters” or “weights” . But good parameters are closely related to good features . Features are human designed . So , all ML algorithms ( most ) are searching for only “parameters” , so human involvement is inevitable .
  • 4. Represent the data ( These are intuitions ) How we are able to classify a laptop as laptop and a mobile as mobile ? Our brain represents something about laptop as something ( we don’t know exactly what that is , say they are “signals” , some form of ) . So we are able to classify these things . Are we a classifier ? I assume so . Every human can be considered as a classifier . They classify same things , but with different representation . I might classify laptop as so , because of the shape and colour , someone else , might do this because of the screen size . So , representations are different .
  • 5. Intelligence ( 1 form ) is learning representation . Assume the following 2 qs . Which is easy to answer ? a.) What is most common word in a “shakespeare drama” ? b.) If Alice liked harry potter, will she like Hunger Games ? So , idea is we need a good representation , which must be able enough to represent the data . ( Think about dimensionality reduction , why it works at least sometimes)
  • 6. Representation examples Word embeddings ( Hard to construct these by hand , learn from data )
  • 8. Neural Networks Another set of ML technique , used to find a set of parameters to perform the task of interest . A simple neural network has only one hidden layer . Input is mapped to a hidden layer and from the hidden layer to the output layer . When we have more that one hidden layer , we called it deep neural networks . ( not deep learning )
  • 9. So what is the difference between NN & conventional ML ?Isn’t it better to do a task with the help of more workers ? Think of more cpu or power to do some sort of task in a computer . Relate these workers to “parameters” . What if we can use or have more parameters to do some task to the data we have ? Isn’t it better . More parameters help us to give a better representation , which in turn helps to perform the task of interest . Most of the ML algorithms ( I don’t know, if there is any exception ) , try to find the “parameters” , which is same as the number of features we have .
  • 10. Contd . . . . The number of parameters , learned by a neural networks is more compared to other algos . This , is depending on the number of hidden nodes you are having . In applications of “usual” machine learning, there is typically a strong focus on the feature engineering part; the model learned by an algorithm can only be so good as its input data. Of course, there must be sufficient discriminatory information in our dataset, however, the performance of machine learning algorithms can suffer substantially when the information is buried in meaningless features.
  • 12. Deep learning ? Deep Neural Networks are hard to learn , because of “vanishing gradient” problem . ( we will come to that later ) the more layers we add, the harder it becomes to “update” our weights because the signal becomes weaker and weaker. Since our network’s weights can be terribly off in the beginning (random initialization) it can become almost impossible to parameterize a “deep” neural network with backpropagation.
  • 13. Contd . . . . deep learning can be noted as “clever” tricks or algorithms that can help with the training of such “deep” neural network structures , which in turn acts as feature detectors . The only algorithm ( as of my knowledge ) , which automatically learn features efficiently from data ) is deep neural networks . ie . Deep Neural Networks = Feature Detectors + Classifier
  • 14. Training a Neural Network Many techniques are there , but most widely accepted is Backpropogation , which is a form of gradient descent . How good your neural network performs ? It’s all about how good your loss function is . What is a loss function ? Loss function = Cost function . ( Read the pros and cons of different loss functions ) BP is a way to search for weights in high dimensional space ( not random search ) , which minimizes our cost function .
  • 15. Backpropogation is a “clever” use of chain rule Assume f(x) and x(t) are univariate functions then ,
  • 16. Assume a simple Loss Function 1-layer NN
  • 17. Use calculus to find parameters “w” and “b”
  • 18. Reverse Mode Differentiation ( Chain Rule Application )
  • 19. Deep learning for NLP All or most state of the art models in NLP are based on Deep Learning . The reason is , anyone can do NLP with the help of deep learning . By saying so , it is always a plus to have linguistic knowledge . But one doesn’t has to worry about the grammatical patterns in a language , because we have Deep Nets to capture automatic features for us . Most widely used models are RNN , CNN , Variational Autoencoder etc .
  • 20. Used CNN . Widely accepted paper “Text Understanding from Scratch”“
  • 21. Word2vec ( Mikolov Google )
  • 24. Machine Translation ( Attention Mechanism ) Breakthrough paper for Attention Modelling
  • 25. Show , Attend and Tell
  • 27. Recurrent Neural Networks Sequence to sequence learning is a very complicated task . Because , the length of the sequence may vary and the model needs to have “memory” . Markov Models are used mainly for this task , but it has constraint of specific window size . ( look back to some x steps ) . It generates a lot of alternatives and score them. Eg : ([I have, I had, I has, me have, me had}) and score them . RNN does not have a window constraint , ( you can put that if you want ) .
  • 28. Contd . . . .
  • 32. This also makes sense ( but confusing to me )
  • 33. Two Issues Vanishing Gradient Large Vocabulary ( Softmax is computationally costly )
  • 34. Vanishing Gradient Assume a network with two 2 hidden layers and other parameters fixed ∥δ1∥=0.07…‖δ1‖=0.07… and ∥δ2∥=0.31…‖δ2‖=0.31…. Assume network with 3 hidden layers . 0.012, 0.060, and 0.283 Assume network with 4 hidden layers 0.003, 0.017, 0.070, and 0.285 The pattern holds , early layers learn slower than later layers .
  • 36. Why ? Assume a very basic neural network Backpropogation
  • 39. Contd . . . . Extreme case , assume derivative is 0.9 ( for tanh ) . What if the sequence is 50 steps long ? 0.9 * 0.9 * ……. * 0.9 = 0.9 ** 50 = 0.00515377520732012 The gradients start vanishing . This is called vanishing gradient . Will gradients explode ? Will we get large values ?
  • 40. Exploding gradients Assume that , derivative is more than 1 ( no way to happen , from the derivative for non linear functions ) . What if 50 step long sequence results in ? 1.1 ** 50 = 117.39085287969579 . As , of my knowledge it happens only when , your weight initialization is bad . Solution , clip the gradients . Do not let your gradients explode . It is a hack , but it works and used widely in practice . But what about vanishing gradients ? What if we use an “identity” function instead of tanh or sigmoid ?
  • 41. From where LSTM come from ?
  • 43. LSTM intuitions Do we have to write everything ? What if we keep everything , to the hidden units ? Will we be able to decode out the useful information , when we need it . What if we avoid non linearity . Allow the gradient to flows back .
  • 44. LSTM ( Long Short term Memory ) Writing Reading forgetting
  • 45. How the gates affect the cell state ?
  • 50. LSTM - Forward Propogation
  • 52. Other ways Input reversal , in RNN encoder decoder framework . Use Leaky ReLU GRU ( Gated Recurrent Units ) .
  • 55. Trending RNN LSTM , with attention Mechanism Facebook Convnet for Machine Translation Neural Turing Machine and Memory Networks GAN , CNN to ResNet , Highway Networks , Coupled LSTM etc . . . . . . . . . . . . . . . .