SlideShare a Scribd company logo
Chatbot
Sequence to Sequence Learning
29 Mar 2017
Presented By:
Jin Zhang
Yang Zhou
Fred Qin
Liam Bui
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Chatbot Concept
Deep Learning for Chatbot: http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/
Overview
Network
Architecture
Loss Function
Improvement
Techniques
LSTM for Language Model
• Language Model
Predict next word given the previous words
• RNN
Unable to learn long term dependency, not suitable for language model
• LSTM
3 sigmoid gates to control info flow
Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Overview
Network
Architecture
Loss Function
Improvement
Techniques
• First step: which previous information to throw
away from the cell state
LSTM for Language Model
• Second step: what new information to be
stored in the cell state
- A sigmoid layer decides which values to
update
- A tanh layer creates new candidate values
C~t that could be added to the state
- Combine these two to create an update to
the state
• Third step: filter Ct and output only what we
want to output
Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Seq2Seq model comprises of two language models:
• Encoder: a language model to encode input sequence into a fixed length vector (thought vector)
• Decoder: another language model to look at both thought vector and previous output to generate next words
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Sequence To Sequence Model
Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Which“Crane”?
I like crane because …
Sequence To Sequence Model
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Sequence To Sequence Model
Sequence Model with Neural Network: https://indico.io/blog/sequence-modeling-neuralnets-part1/
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Generating a word is a multi-class classification task over all possible words, i.e. vocabulary.
W* = argmaxW P(W|Previous words)
Example :
I always order pizza with cheese and ……
mushrooms 0.15
pepperoni 0.12
anchovies 0.01
….
rice 0.0001
and 1e-100
Loss Function
Cross Entropy Loss:
Cross-Entropy:
Cross-Entropy for a sentence w1, w2, …, wn:
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Evaluating Language Model: https://courses.engr.illinois.edu/cs498jh/Slides/Lecture04.pdf
Perplexity:
In practice, a variant called perplexity is usually used as metric to evaluate language models.
• Cross entropy can be seen as a measure of uncertainty
• Perplexity can be seen as “number of choices”
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Cross entropy loss vs Perplexity:
• Entropy: ~2.58
• Perplexity: 6 choices
• Which statement do you prefer?
- The die has 6 faces
- The die has 2.58 entropy
• We can see perplexity as the average choices each time. The higher it is, the more “choices” of words
you have, then the more uncertain the language model is.
• Example: 6 faced balanced die. Each face is numbered from 1 to 6 so we have
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- The last state of the encoder contains mostly
information from the last elements of the
encoder sequence
- Inverse Input Sequence helps in some cases
How are you ?
I am fine .
Attention Mechanism:
- Allow each stage in decoder to look at any
encoder stages
- Decoder understand the input sentence more
and look at suitable positions to generate words
Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- The last state of the encoder contains mostly
information from the last elements of the
encoder sequence
- Inverse Input Sequence helps in some cases
Attention Mechanism:
- Allow each stage in decoder to look at any
encoder stages
- Decoder understand the input sentence more
and look at suitable positions to generate words
Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
Seq2Seq Seq2Seq with
attention
Sentence Length - 30 13.93 21.50
Sentence Length - 50 17.82 28.45
BLEU score on English-French Translation corpus
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- Maximizing conditional probabilities at each stage
might not lead to maximum full-joint probability.
- Storing all possible generated sentences are not
feasible due to resource limitation.
Possible output 2: Never been better
Possible output 1: I am fine
Beam Search:
- At each stage in decoder, store best M possible
outputs
Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
Conditional Probability: 0.6 0.4 1
Conditional Probability: 0.4 0.9 1
Full-joint
Probability:
0.24
0.36
Possible Output 1:
Possible Output 2:
Possible Output M:
How are you ?
I am fine .
…
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- Maximizing conditional probabilities at each stage
might not lead to maximum full-joint probability.
- Storing all possible generated sentences are not
feasible due to memory limitation.
Beam Search:
- At each stage in decoder, store best M possible
outputs
Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
Seq2Seq with
beam-size = 1
Seq2Seq with
beam size = 12
28.45 30.59
BLEU score on English-French Translation corpus.
Max sentence length 50
Overview
Network
Architecture
Loss Function
Improvement
Techniques
…
APPENDIX
Cross Entropy Loss:
Cross-Entropy:
Cross-Entropy for a sentence w1, w2, …, wn:
Overview
Network
Architecture
Loss Function
Improvement
Techniques
= −𝑙𝑜𝑔2 𝑚(𝑥∗
)
= − 𝑙𝑜𝑔2 𝑚(𝑤1
∗
, … , 𝑤 𝑛
∗
)
= − [𝑙𝑜𝑔2 𝑚 𝑤 𝑛
∗
|𝑤1
∗
, … , 𝑤 𝑛−1
∗
+ 𝑙𝑜𝑔2 𝑚 𝑤 𝑛−1
∗
|𝑤1
∗
, … , 𝑤 𝑛−2
∗
+ … + 𝑙𝑜𝑔2 𝑚 𝑤1
∗
]
sum of log-probability in decoding steps
Overview
Network
Architecture
Loss Function
Improvement
Techniques
1. Reinforcement Learning:
Longer sentence is usually more interesting. So, we can use sentence length as rewards to further train the model:
• Action: Word choice
• State: Current generated sentence
• Reward: Sentence Length
2. Adversarial Training:
Make generated sentences look real using Adversarial training:
• Generative Model: generate sentences based on inputs
• Discriminant Model: tries to tell if a sentence is true response or generated response
• Objective: train generative model to “fool“ discriminant model
Adversarial Learning for Neural Dialogue Generation: https://arxiv.org/abs/1701.06547

More Related Content

What's hot

Why Social Media Chat Bots Are the Future of Communication - Deck
Why Social Media Chat Bots Are the Future of Communication - DeckWhy Social Media Chat Bots Are the Future of Communication - Deck
Why Social Media Chat Bots Are the Future of Communication - Deck
Jan Rezab
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
Barbara Fusinska
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
Rohan Chikorde
 
Chatbots 101
Chatbots 101Chatbots 101
Chatbots 101
Venu Vasudevan
 
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
Shrutika Oswal
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
Tomer Lieber
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Carlos Castillo (ChaTo)
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Sajitha Burvin
 
AI Agent and Chatbot Trends For Enterprises
AI Agent and Chatbot Trends For EnterprisesAI Agent and Chatbot Trends For Enterprises
AI Agent and Chatbot Trends For Enterprises
Teewee Ang
 
Let's Build a Chatbot!
Let's Build a Chatbot!Let's Build a Chatbot!
Let's Build a Chatbot!
Christopher Mohritz
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
Anuj Gupta
 
What is a chatbot?
What is a chatbot?What is a chatbot?
What is a chatbot?
Kamini Bharti
 
NLP with Deep Learning
NLP with Deep LearningNLP with Deep Learning
NLP with Deep Learning
fmguler
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Devashish Shanker
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
Vikrant Arya
 
CHATBOT PPT-2.pptx
CHATBOT PPT-2.pptxCHATBOT PPT-2.pptx
CHATBOT PPT-2.pptx
LohithaJangala
 
Chatbots
ChatbotsChatbots
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Future of AI
Future of AIFuture of AI

What's hot (20)

Why Social Media Chat Bots Are the Future of Communication - Deck
Why Social Media Chat Bots Are the Future of Communication - DeckWhy Social Media Chat Bots Are the Future of Communication - Deck
Why Social Media Chat Bots Are the Future of Communication - Deck
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Chatbots 101
Chatbots 101Chatbots 101
Chatbots 101
 
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
AI Agent and Chatbot Trends For Enterprises
AI Agent and Chatbot Trends For EnterprisesAI Agent and Chatbot Trends For Enterprises
AI Agent and Chatbot Trends For Enterprises
 
Let's Build a Chatbot!
Let's Build a Chatbot!Let's Build a Chatbot!
Let's Build a Chatbot!
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
What is a chatbot?
What is a chatbot?What is a chatbot?
What is a chatbot?
 
NLP with Deep Learning
NLP with Deep LearningNLP with Deep Learning
NLP with Deep Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
CHATBOT PPT-2.pptx
CHATBOT PPT-2.pptxCHATBOT PPT-2.pptx
CHATBOT PPT-2.pptx
 
Chatbots
ChatbotsChatbots
Chatbots
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Future of AI
Future of AIFuture of AI
Future of AI
 

Viewers also liked

Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
Jitin Dua
 
Real-time Emotion Recognition
Real-time Emotion RecognitionReal-time Emotion Recognition
Real-time Emotion Recognition
Liam Bui
 
Implementing Your Own Chatbot Platform!
Implementing Your Own Chatbot Platform!Implementing Your Own Chatbot Platform!
Implementing Your Own Chatbot Platform!
Oracle Developers
 
AI_Workshop
AI_WorkshopAI_Workshop
AI_Workshop
Mohith Damarapati
 
Chat-bots y el futuro de las apps sin interfaz - Chatbots
Chat-bots y el futuro de las apps sin interfaz - ChatbotsChat-bots y el futuro de las apps sin interfaz - Chatbots
Chat-bots y el futuro de las apps sin interfaz - Chatbots
Luis Díaz del Dedo
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
Sigmoid
 
Ondrisek @ DevTernity "Insights into Chatbot Development - Implementing Cros...
Ondrisek @ DevTernity  "Insights into Chatbot Development - Implementing Cros...Ondrisek @ DevTernity  "Insights into Chatbot Development - Implementing Cros...
Ondrisek @ DevTernity "Insights into Chatbot Development - Implementing Cros...
Barbara Ondrisek
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
hen_drik
 
Developing Korean Chatbot 101
Developing Korean Chatbot 101Developing Korean Chatbot 101
Developing Korean Chatbot 101
Jaemin Cho
 
Conversation diagram for Econsultancy's Facebook chatbot
Conversation diagram for Econsultancy's Facebook chatbotConversation diagram for Econsultancy's Facebook chatbot
Conversation diagram for Econsultancy's Facebook chatbot
Econsultancy
 
Tech & Digital Trend 2017
Tech & Digital Trend 2017Tech & Digital Trend 2017
Tech & Digital Trend 2017
IQUII
 
Seq2 seq learning
Seq2 seq learningSeq2 seq learning
Seq2 seq learning
Vu Pham
 
Building Modern Applications Using APIs, Microservices and Chatbots
Building Modern Applications Using APIs, Microservices and ChatbotsBuilding Modern Applications Using APIs, Microservices and Chatbots
Building Modern Applications Using APIs, Microservices and Chatbots
Oracle Developers
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
Cri du chat
Cri du chatCri du chat
Film distrubutors 2
Film distrubutors 2Film distrubutors 2
Film distrubutors 2
jesscrawford9
 
WEF Energy architecture_performance_index_2017
WEF Energy architecture_performance_index_2017WEF Energy architecture_performance_index_2017
WEF Energy architecture_performance_index_2017
Willy Marroquin (WillyDevNET)
 
Fiscalía
Fiscalía Fiscalía
Fiscalía
Helen Mont
 
Elaboracao cargos salarios
Elaboracao cargos salariosElaboracao cargos salarios
Elaboracao cargos salarios
Ana Maria Kaspar
 
Chat Bot Architecture
Chat Bot ArchitectureChat Bot Architecture
Chat Bot Architecture
Yegor Bugayenko
 

Viewers also liked (20)

Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Real-time Emotion Recognition
Real-time Emotion RecognitionReal-time Emotion Recognition
Real-time Emotion Recognition
 
Implementing Your Own Chatbot Platform!
Implementing Your Own Chatbot Platform!Implementing Your Own Chatbot Platform!
Implementing Your Own Chatbot Platform!
 
AI_Workshop
AI_WorkshopAI_Workshop
AI_Workshop
 
Chat-bots y el futuro de las apps sin interfaz - Chatbots
Chat-bots y el futuro de las apps sin interfaz - ChatbotsChat-bots y el futuro de las apps sin interfaz - Chatbots
Chat-bots y el futuro de las apps sin interfaz - Chatbots
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
 
Ondrisek @ DevTernity "Insights into Chatbot Development - Implementing Cros...
Ondrisek @ DevTernity  "Insights into Chatbot Development - Implementing Cros...Ondrisek @ DevTernity  "Insights into Chatbot Development - Implementing Cros...
Ondrisek @ DevTernity "Insights into Chatbot Development - Implementing Cros...
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Developing Korean Chatbot 101
Developing Korean Chatbot 101Developing Korean Chatbot 101
Developing Korean Chatbot 101
 
Conversation diagram for Econsultancy's Facebook chatbot
Conversation diagram for Econsultancy's Facebook chatbotConversation diagram for Econsultancy's Facebook chatbot
Conversation diagram for Econsultancy's Facebook chatbot
 
Tech & Digital Trend 2017
Tech & Digital Trend 2017Tech & Digital Trend 2017
Tech & Digital Trend 2017
 
Seq2 seq learning
Seq2 seq learningSeq2 seq learning
Seq2 seq learning
 
Building Modern Applications Using APIs, Microservices and Chatbots
Building Modern Applications Using APIs, Microservices and ChatbotsBuilding Modern Applications Using APIs, Microservices and Chatbots
Building Modern Applications Using APIs, Microservices and Chatbots
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Cri du chat
Cri du chatCri du chat
Cri du chat
 
Film distrubutors 2
Film distrubutors 2Film distrubutors 2
Film distrubutors 2
 
WEF Energy architecture_performance_index_2017
WEF Energy architecture_performance_index_2017WEF Energy architecture_performance_index_2017
WEF Energy architecture_performance_index_2017
 
Fiscalía
Fiscalía Fiscalía
Fiscalía
 
Elaboracao cargos salarios
Elaboracao cargos salariosElaboracao cargos salarios
Elaboracao cargos salarios
 
Chat Bot Architecture
Chat Bot ArchitectureChat Bot Architecture
Chat Bot Architecture
 

Similar to Deep learning - Chatbot

Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
Olaf de Leeuw
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
Yash Diwakar
 
2020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 72020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 7
FRSecure
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
changedaeoh
 
Chatbot
ChatbotChatbot
Chatbot
Liam Bui
 
Long and short term memory presesntation
Long and short term memory presesntationLong and short term memory presesntation
Long and short term memory presesntation
chWaqasZahid
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
QuantUniversity
 
Bitcoin Price Prediction
Bitcoin Price PredictionBitcoin Price Prediction
Bitcoin Price Prediction
Kadambini Indurkar
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caring
sporst
 
ch01-basic-java-programs.ppt
ch01-basic-java-programs.pptch01-basic-java-programs.ppt
ch01-basic-java-programs.ppt
Mahyuddin8
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
Vsevolod Dyomkin
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
LLMs for the “GPU-Poor” - Franck Nijimbere.pdfLLMs for the “GPU-Poor” - Franck Nijimbere.pdf
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
GDG Bujumbura
 
Memory models
Memory modelsMemory models
Memory models
Dr. C.V. Suresh Babu
 
EchoBay: Automatic Optimization for Echo State Networks - talk version
EchoBay: Automatic Optimization for Echo State Networks - talk versionEchoBay: Automatic Optimization for Echo State Networks - talk version
EchoBay: Automatic Optimization for Echo State Networks - talk version
NECST Lab @ Politecnico di Milano
 
Dstc6 an introduction
Dstc6 an introductionDstc6 an introduction
Dstc6 an introduction
hkh
 
Culturally Responsive Literacy Resources Template Part 1.docx
Culturally Responsive Literacy Resources Template Part 1.docxCulturally Responsive Literacy Resources Template Part 1.docx
Culturally Responsive Literacy Resources Template Part 1.docx
dorishigh
 
Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...
Czech Technical University in Prague
 
Writing code for people
Writing code for peopleWriting code for people
Writing code for people
Alexey Ivanov
 

Similar to Deep learning - Chatbot (20)

Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
 
2020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 72020 FRSecure CISSP Mentor Program - Class 7
2020 FRSecure CISSP Mentor Program - Class 7
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Chatbot
ChatbotChatbot
Chatbot
 
Long and short term memory presesntation
Long and short term memory presesntationLong and short term memory presesntation
Long and short term memory presesntation
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Bitcoin Price Prediction
Bitcoin Price PredictionBitcoin Price Prediction
Bitcoin Price Prediction
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caring
 
ch01-basic-java-programs.ppt
ch01-basic-java-programs.pptch01-basic-java-programs.ppt
ch01-basic-java-programs.ppt
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
LLMs for the “GPU-Poor” - Franck Nijimbere.pdfLLMs for the “GPU-Poor” - Franck Nijimbere.pdf
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
 
Memory models
Memory modelsMemory models
Memory models
 
EchoBay: Automatic Optimization for Echo State Networks - talk version
EchoBay: Automatic Optimization for Echo State Networks - talk versionEchoBay: Automatic Optimization for Echo State Networks - talk version
EchoBay: Automatic Optimization for Echo State Networks - talk version
 
Dstc6 an introduction
Dstc6 an introductionDstc6 an introduction
Dstc6 an introduction
 
Culturally Responsive Literacy Resources Template Part 1.docx
Culturally Responsive Literacy Resources Template Part 1.docxCulturally Responsive Literacy Resources Template Part 1.docx
Culturally Responsive Literacy Resources Template Part 1.docx
 
Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...
 
Writing code for people
Writing code for peopleWriting code for people
Writing code for people
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Deep learning - Chatbot

  • 1. Chatbot Sequence to Sequence Learning 29 Mar 2017 Presented By: Jin Zhang Yang Zhou Fred Qin Liam Bui Overview Network Architecture Loss Function Improvement Techniques
  • 2. Overview Network Architecture Loss Function Improvement Techniques Chatbot Concept Deep Learning for Chatbot: http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/
  • 3. Overview Network Architecture Loss Function Improvement Techniques LSTM for Language Model • Language Model Predict next word given the previous words • RNN Unable to learn long term dependency, not suitable for language model • LSTM 3 sigmoid gates to control info flow Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 4. Overview Network Architecture Loss Function Improvement Techniques • First step: which previous information to throw away from the cell state LSTM for Language Model • Second step: what new information to be stored in the cell state - A sigmoid layer decides which values to update - A tanh layer creates new candidate values C~t that could be added to the state - Combine these two to create an update to the state • Third step: filter Ct and output only what we want to output Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 5. Seq2Seq model comprises of two language models: • Encoder: a language model to encode input sequence into a fixed length vector (thought vector) • Decoder: another language model to look at both thought vector and previous output to generate next words Overview Network Architecture Loss Function Improvement Techniques Sequence To Sequence Model Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
  • 7. Overview Network Architecture Loss Function Improvement Techniques Sequence To Sequence Model Sequence Model with Neural Network: https://indico.io/blog/sequence-modeling-neuralnets-part1/
  • 8. Overview Network Architecture Loss Function Improvement Techniques Generating a word is a multi-class classification task over all possible words, i.e. vocabulary. W* = argmaxW P(W|Previous words) Example : I always order pizza with cheese and …… mushrooms 0.15 pepperoni 0.12 anchovies 0.01 …. rice 0.0001 and 1e-100 Loss Function
  • 9. Cross Entropy Loss: Cross-Entropy: Cross-Entropy for a sentence w1, w2, …, wn: Overview Network Architecture Loss Function Improvement Techniques Evaluating Language Model: https://courses.engr.illinois.edu/cs498jh/Slides/Lecture04.pdf Perplexity: In practice, a variant called perplexity is usually used as metric to evaluate language models.
  • 10. • Cross entropy can be seen as a measure of uncertainty • Perplexity can be seen as “number of choices” Overview Network Architecture Loss Function Improvement Techniques Cross entropy loss vs Perplexity: • Entropy: ~2.58 • Perplexity: 6 choices • Which statement do you prefer? - The die has 6 faces - The die has 2.58 entropy • We can see perplexity as the average choices each time. The higher it is, the more “choices” of words you have, then the more uncertain the language model is. • Example: 6 faced balanced die. Each face is numbered from 1 to 6 so we have
  • 11. Overview Network Architecture Loss Function Improvement Techniques Problem: - The last state of the encoder contains mostly information from the last elements of the encoder sequence - Inverse Input Sequence helps in some cases How are you ? I am fine . Attention Mechanism: - Allow each stage in decoder to look at any encoder stages - Decoder understand the input sentence more and look at suitable positions to generate words Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
  • 12. Overview Network Architecture Loss Function Improvement Techniques Problem: - The last state of the encoder contains mostly information from the last elements of the encoder sequence - Inverse Input Sequence helps in some cases Attention Mechanism: - Allow each stage in decoder to look at any encoder stages - Decoder understand the input sentence more and look at suitable positions to generate words Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473 Seq2Seq Seq2Seq with attention Sentence Length - 30 13.93 21.50 Sentence Length - 50 17.82 28.45 BLEU score on English-French Translation corpus
  • 13. Overview Network Architecture Loss Function Improvement Techniques Problem: - Maximizing conditional probabilities at each stage might not lead to maximum full-joint probability. - Storing all possible generated sentences are not feasible due to resource limitation. Possible output 2: Never been better Possible output 1: I am fine Beam Search: - At each stage in decoder, store best M possible outputs Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Conditional Probability: 0.6 0.4 1 Conditional Probability: 0.4 0.9 1 Full-joint Probability: 0.24 0.36 Possible Output 1: Possible Output 2: Possible Output M: How are you ? I am fine . …
  • 14. Overview Network Architecture Loss Function Improvement Techniques Problem: - Maximizing conditional probabilities at each stage might not lead to maximum full-joint probability. - Storing all possible generated sentences are not feasible due to memory limitation. Beam Search: - At each stage in decoder, store best M possible outputs Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Seq2Seq with beam-size = 1 Seq2Seq with beam size = 12 28.45 30.59 BLEU score on English-French Translation corpus. Max sentence length 50
  • 17. Cross Entropy Loss: Cross-Entropy: Cross-Entropy for a sentence w1, w2, …, wn: Overview Network Architecture Loss Function Improvement Techniques = −𝑙𝑜𝑔2 𝑚(𝑥∗ ) = − 𝑙𝑜𝑔2 𝑚(𝑤1 ∗ , … , 𝑤 𝑛 ∗ ) = − [𝑙𝑜𝑔2 𝑚 𝑤 𝑛 ∗ |𝑤1 ∗ , … , 𝑤 𝑛−1 ∗ + 𝑙𝑜𝑔2 𝑚 𝑤 𝑛−1 ∗ |𝑤1 ∗ , … , 𝑤 𝑛−2 ∗ + … + 𝑙𝑜𝑔2 𝑚 𝑤1 ∗ ] sum of log-probability in decoding steps
  • 18. Overview Network Architecture Loss Function Improvement Techniques 1. Reinforcement Learning: Longer sentence is usually more interesting. So, we can use sentence length as rewards to further train the model: • Action: Word choice • State: Current generated sentence • Reward: Sentence Length 2. Adversarial Training: Make generated sentences look real using Adversarial training: • Generative Model: generate sentences based on inputs • Discriminant Model: tries to tell if a sentence is true response or generated response • Objective: train generative model to “fool“ discriminant model Adversarial Learning for Neural Dialogue Generation: https://arxiv.org/abs/1701.06547

Editor's Notes

  1. First of all, let’s see a demo. This is a customer service chatbot demo. We can see that you can let it find an order as easy as chatting with a person. That’s why chatbot is a very hot topic. Many companies are working on various kinds of chatbots, including travel search engine, personal health companion and so on. There are three ways to compare chatbots. Retrieval-based models don’t generate any new text, they just pick a response from a predefined responses repository based on the input and context. In such case, retrieval-based methods don’t make grammatical mistakes. However, they may be unable to handle unseen cases for which no appropriate predefined response exists. For the same reasons, these models can’t refer back to contextual entity information like names mentioned earlier in the conversation. However, Generative models don’t rely on pre-defined response. They generate new responses from scratch. Generative models are “smarter”. They can refer back to entities in the input and give the impression that you’re talking to a human. However, these models are hard to train, are quite likely to make grammatical mistakes (especially on longer sentences), and typically require huge amounts of training data. Generative models are “smarter”. They can refer back to entities in the input and give the impression that you’re talking to a human. Chatbots can be built to support short-text conversations, such as FAQ chatbot, or long conversations, such as customer support chatbot. Chatbots can be set to closed domain or open domain. The demo of this customer service chatbot is an example of the closed domain, in which the questions and answers are limited to specific area. In an open domain (harder) setting the user can take the conversation anywhere, such as siri. Retrieval-based models (easier) use a repository of predefined responses and some kind of heuristic to pick an appropriate response based on the input and context. The heuristic could be as simple as a rule-based expression match, or as complex as an ensemble of Machine Learning classifiers. These systems don’t generate any new text, they just pick a response from a fixed set. Due to the repository of handcrafted responses, retrieval-based methods don’t make grammatical mistakes. However, they may be unable to handle unseen cases for which no appropriate predefined response exists. For the same reasons, these models can’t refer back to contextual entity information like names mentioned earlier in the conversation. Generative models are “smarter”. They can refer back to entities in the input and give the impression that you’re talking to a human. However, these models are hard to train, are quite likely to make grammatical mistakes (especially on longer sentences), and typically require huge amounts of training data. Short-Text Conversations (easier) where the goal is to create a single response to a single input. For example, you may receive a specific question from a user and reply with an appropriate answer. Then there are long conversations (harder) where you go through multiple turns and need to keep track of what has been said. Customer support conversations are typically long conversational threads with multiple questions. In a closed domain (easier) setting the space of possible inputs and outputs is somewhat limited because the system is trying to achieve a very specific goal. Technical Customer Support or Shopping Assistants are examples of closed domain problems. In an open domain (harder) setting the user can take the conversation anywhere. There isn’t necessarily have a well-defined goal or intention. The infinite number of topics and the fact that a certain amount of world knowledge is required to create reasonable responses makes this a hard problem.
  2. The foundation of building a chatbot is language modelling. Generally speaking, a language model takes in a sequence of inputs, looks at each element of the sequence and tries to predict the next element of the sequence. In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. LSTMs are explicitly designed to avoid the long-term dependency problem. The key to the LSTM is the cell state, easy for information to just flow along it unchanged. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.
  3. The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht-1 and xt and outputs a number between 0 and 1 for each number in the cell state Ct-1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.” Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject. The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state. In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting. It’s now time to update the old cell state, C~t-1into the new cell state C~t. The previous steps already decided what to do, we just need to actually do it. We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add C~t*it .This is the new candidate values, scaled by how much we decided to update each state value. In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps. Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to. For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next. RNNs can be used as language models for predicting future elements of a sequence given prior elements of the sequence. However, we are still missing the components necessary for building translation models since we can only operate on a single sequence, while translation operates on two sequences – the input sequence and the translated sequence.
  4. Sequence to sequence models build on top of language models by adding an encoder step and a decoder step. In the encoder step, a model converts an input sequence into a thought vector. In the decoder step, a language model is trained on both the output sequence as well as the thought vector from the encoder. Since the decoder model sees an encoded representation of the input sequence as well as the output sequence, it can make more intelligent predictions about future words based on the current word.
  5. For example, in a standard language model, we might see the word “crane” and not be sure if the next word should be about the bird or heavy machinery. However, if we also pass an encoder context, the decoder might realize that the input sequence was about construction, not flying animals. Given the context, the decoder can choose the appropriate next word and provide more accurate reply.
  6. Now that we understand the basics of sequence to sequence modeling, we can consider how to build one. We will use LSTM as encoder and decoder. The encoder takes a sequence(sentence) as input and processes one symbol(word) at each time step. Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information. Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence. From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols. We can train the model using a gradient-based algorithm, update parameters of encoder and decoder, jointly maximize the log probability of the output sequence conditioned on the input sequence. Once the model is trained, we can make predictions The context can be provided as the initial state of the decoder RNN or it can be connected to the hidden units at each time step. Now our objective is to jointly maximize the log probability of the output sequence conditioned on the input sequence.
  7. whichever a model gives us the highest prop for all the words should be our model.
  8. Evaluate per word perplexity. For the probability, we definitely think of cross entropy. https://courses.engr.illinois.edu/cs498jh/Slides/Lecture04.pdf by applying the chain rule, we can get perplexity per word.
  9. Compressing an entire input sequence into a single fixed vector is challenging. The last state of the encoder contains mostly information from the last elements of the encoder sequence This mechanism will hold onto all states from the encoder and give the decoder a weighted average of the encoder states for each element of the decoder sequence During the decoding phase, we take the state of the decoder network, combine it with the encoder states, and pass this combination to a feedforward network. The feedforward network returns weights for each encoder state. We multiply the encoder inputs by these weights and then compute a weighted average of the encoder states.
  10. BLEU (bilingual evaluation understudy): measures the correspondence between a machine's output and that of a human BLEU = sum: max(word count in generated sentence, word count in referenced sentence)/total generated sentence length for each word
  11. Maximizing conditional probabilities at each stage might not lead to maximum full-joint probability. We could store all possible generated sentences so that we always find the maximum full-joint probability, but it would not be feasible. A practical solution would be something in between.