This document summarizes a workshop on trimming and training data for smart conversations delivered by Ines Lin. The workshop covers identifying the right data for chatbot training, the natural language processing pipeline, underestimating the importance of data preprocessing, and machine learning algorithms. It also discusses continuous improvement of chatbots through data optimization and use of bot analytics tools.
2. www.vanillanbanana.com
2
WELCOME !
Know how to deploy
your chatbots and data
flow infrastructure in
the backend
After this workshop, you will be able to
1.
Identify the chatbot for
your companies
Select the right data for
your chatbot’s NLU*
machine learning
model
2. 3.
* Natural Language Understanding
3. www.vanillanbanana.com
3
WHO I AM
Ines Lin, PMP
Project Manager
Chatbot at ASUS
9 yearsTechnical & Copy Writer
6 yearsSystem & Cloud Project Manager
Applied Linguistic Major
PMP Certification
DL Specialization
4. www.vanillanbanana.com
4
READY SOLUTION, OR
MAKE YOUR OWN? AND
OTHER TOOLS TO HELP
YOU MONITOR AND
IMPROVE YOUR
CHATBOT
CONTINUOUS
IMPROVEMENT AND
DATA OPTIMIZATION
WHAT ALGORITHMS,
MACHINE LEARNING MODEL,
NATURAL LANGUAGE
UNDERSTADNING, AND
ARTIFICIAL INTELLIGNECE
EXACTLY ARE.
UNDERSTANDING
BETTER THE BIG
WORDS
WITH ALL THE DATA
COLLECTED BY COMPNAY,
WHICH TYPES TO USE, AND
HOW TO USE THEM?
IDENTIFYING THE
RIGHT DATA,
CHANNELS, AND
SOURCES
THE MORE, THE BETTER?
THE UNDERESTIMATED DATA
PRE-PROCESSING
HOW MUCH DATA IS
ENOUGH TO TRIAN A
CHABOT
AGENDA
5. www.vanillanbanana.com
5
3 42
1
• How much data is enough to train a
chatbot
• Recipe for machine learning
• Structuring your project
QUANTITY
• Identify the right data for
conversational model – QA
pairing model
• The NLP Pipeline
• The underestimated data pre-
processing
DATA
• Machine Learning Algorithms-what they
do
• Natural Language Understanding-
approaches-how to do it
• Natural Language Processing-techniques
BIG WORDS
• The journey towards continuous
improvement and data
optimization
• Ecosystem among your model,
ready solution, and corporate data
• Utilize bot analytics for constant
monitor
JOURNEY
TAKEAWAYS
11. www.vanillanbanana.com
11
SUPERVISED
3. CLASSIFICATION & REGRESSION TREE
(BINARY DECISSION TREE)
4. NAIVE BAYES (BINARY)
• All variables are independent
• All data attributes do not interact
• Data fits normal distribution
But it works very well in real practice
13. www.vanillanbanana.com
13
13
IS THE PASSENGER GOING TO SURVIVE?
• Real time Prediction: It’s fast
• Multi class Prediction: calculating the
probabilities
• Text classification/ Spam Filtering/
Sentiment Analysis:
• Recommendation System:
20. www.vanillanbanana.com
20
ENSEMBLE
9. SEQUENTIAL 10. PARALLEL
• Base learners are generated sequentially
(AdaBoost = Adaptive Boost)
• Motivation: To exploit DEPENDENCE
between the base learners
• Overall performance: can be boosted by
weighing previously mislabeled examples
with higher weight.
• Base learners are generated sequentially (AdaBoost =
Adaptive Boost)
• Motivation: To exploit DEPENDENCE between the base
learners
• Overall performance: can be boosted by weighing
previously mislabeled examples with higher weight.
26. www.vanillanbanana.com
26
FRAME-BASED
“A frame is a data-structure for
representing a stereotyped situation,”
” Think of frames as a canonical
representation for which specifics can be
interchanged.”
27. www.vanillanbanana.com
27
MODEL -THEORETICAL
Model theory refers to the idea that sentences
refer to the world, as in the case with
grounded language (i.e. the block is blue).
In compositionality, meanings of the parts of a
sentence can be combined to deduce the
whole meaning.
Linguistic Approach
29. www.vanillanbanana.com
29
RECURRENT NEURAL
NETWORK (RNN)
LSTMs are a special type of RNNs,
where you connect these units in a
specific way, to avoid some
problems that arise in regular RNNs
[vanishing and exploding gradient].
LSTM
Problem setting
Input = sequence
Output = sequence
SEQUENCE TO
SEQUENCE
a class of neural networks where there are
loops in the network graph, and the output
of one unit may go back to one of the
already visited units.
IMPORTANT CONCEPTS IN NLP
For Chatbot
32. www.vanillanbanana.com
32
SEQUENCE TO SEQUENCE
Learning for Language
PROBLEM SETTING
Input is a sequence and output is
also a sequence.
e.g. machine translation, question
answering,
generating natural language
description of videos,
automatic summarization, etc..
33. www.vanillanbanana.com
33
SEQUENCE TO SEQUENCE
Model with RNN Modules Inside
A sequence-to-sequence model that
use LSTMs/RNNs as modules
inside to solve a sequence to
sequence task, i.e. chatbot Q-A
capability
Encoder are neural networks to train
the parameters
36. www.vanillanbanana.com
36
3 41
2
• How much data is enough to train a
chatbot
• Recipe for machine learning
• Structuring your project
QUANTITY
• Identify the right data for
conversational model – QA
pairing model
• The NLP Pipeline
• The underestimated data pre-
processing
DATA
• Machine Learning Algorithms-what they
do
• Natural Language Understanding-
approaches-how to do it
• Natural Language Processing-techniques
BIG WORDS
• The journey towards continuous
improvement and data
optimization
• Ecosystem among your model,
ready solution, and corporate data
• Utilize bot analytics for constant
monitor
JOURNEY
TAKEAWAYS
38. www.vanillanbanana.com
38
Style: fixed question descriptions
Language: descriptive and instructive
Characteristics: one final question to
answer set in the same domain
knowledge
ICR RECORDS
Style: short Q & A
Language:
conversational
Characteristics: most
similar to human
conversations, not only
Q & A but also other
interruptions
CHAT LOGS
Style: paragraphs
Language: descriptive and instructive
Characteristics: many to many in the
same domain knowledge
FAQ-KNOWLEDGE BANK
Style: paragraphs
Language: descriptive and instructive
Characteristics: many to many in the
same domain knowledge
eMAIL
COMPANY COLLECTED DATA
39. www.vanillanbanana.com
39
WHAT WE EXPECT THEM TO BE LIKE
Something went wrong with my phone
Would you please let me know the model of your phone?
What’s the exact problem you having
It’s ZH008. I only used it for 3 months. The Bluetooth just
stopped working
Have you tried to reboot your phone?
No, I haven’t. Will that help?
Usually it solves 90% in this case. Can you try to reboot and
see how it works?
Hold on a second… Oh the BT icon works
again.
Okay thanks
No problem. Have a nice
day
40. www.vanillanbanana.com
40
Hi?
Yes, this is Michael. How can I help you?
Oh, ok…Umm, I think something is wrong with my Bluetooth. I
cannot
Can you describe more on the Bluetooth problem?
WHAT THEY ACTUALLY ARE
turn it on or off anymore. It just doesn’t respond at all. It’s till in the
warranty period, so I am thinking to send to your service center. Can
Have you tried to reboot the phone ?
Reboot? What do you mean?
How many service centers do you have in Taipei?
41. www.vanillanbanana.com
41
RAW DATA PROBLEM 1
THE IDEAL DATA OBSTACLES IN REALITY
One-on-one Q & A relationship
Q has only one intent
Q doesn’t have typos, or
Q and A don’t always carry on in order
There are typos
More than one intent in one Q
There are irrelevant information in Q
42. www.vanillanbanana.com
42
42
Raw Data Preprocessing
Tokenization for
language units
Mathematical
representation of
language units
Deciding
training/test data
Train model
using training
data
Test the model
with the test
data
Building ML
model
NLP PIPELINE WITH ML
43. www.vanillanbanana.com
43
Raw data
Clean data
Remove Test Data
Remove Sessions with One Person
Remove Agents’ talk
Remove System Announcement
Concatenate Sessions (within x mins)
Remove Duplicates
Remove Null
Remove non-text info (video, image, audio) Feature extraction,
such as
Dialogue avg. session
FAQ
Keywords
Data Cleaning Data Summarization
DATA PRE-PROCESSING
45. www.vanillanbanana.com
45
41
3
2
• How much data is enough to train a
chatbot
• Recipe for machine learning
• Structuring your project
QUANTITY
• Identify the right data for
conversational model – QA
pairing model
• The NLP Pipeline
• The underestimated data pre-
processing
DATA
• Machine Learning Algorithms-what they
do
• Natural Language Understanding-
approaches-how to do it
• Natural Language Processing-techniques
BIG WORDS
• The journey towards continuous
improvement and data
optimization
• Ecosystem among your model,
ready solution, and corporate data
• Utilize bot analytics for constant
monitor
JOURNEY
TAKEAWAYS
48. www.vanillanbanana.com
48
1 2
4
• How much data is enough to train a
chatbot
• Recipe for machine learning
• Structuring your project
QUANTITY
• Identify the right data for
conversational model – QA
pairing model
• The NLP Pipeline
• The underestimated data pre-
processing
DATA
• Machine Learning Algorithms-what they
do
• Natural Language Understanding-
approaches-how to do it
• Natural Language Processing-techniques
BIG WORDS
• The journey towards continuous
improvement and data
optimization
• Ecosystem among your model,
ready solution, and corporate data
• Utilize bot analytics for constant
monitor
JOURNEY
TAKEAWAYS
3
57. www.vanillanbanana.com
57
KNOW WHICH KIND
OF DATA TO USE
KNOW HOW TO
CONNECT/
UTILIZE CORPORATE
DATA WITH
CURRENT SOLUTION
DEMYSTIFY
ML, NLP,
ALGORITHMS AND
CHATBOT
KNOW HOW TO
DESIGN
THE DATA FLOW FOR
ML ECOSYSTEM
RECAP
Stanford CS professor Percy Liang’s talk on NLU
Source: https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/
Machine learning is about prediction
A lot of techniques are borrowed from other fields, including statistics and neural network
Source: https://www.kdnuggets.com/2017/10/top-10-machine-learning-algorithms-beginners.html
Source: https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11
https://machinelearningmastery.com/naive-bayes-for-machine-learning/
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems.
https://chrisalbon.com/images/machine_learning_flashcards/Gaussian_Naive_Bayes_Classifier_print.png
https://machinelearningmastery.com/naive-bayes-for-machine-learning/
It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.
This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained
https://www.slideshare.net/TessFerrandez/notes-from-coursera-deep-learning-courses-by-andrew-ng?from_m_app=ios
Source: https://medium.com/@datamonsters/artificial-neural-networks-for-natural-language-processing-part-1-64ca9ebfa3b2
Source: https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11
https://wechat.kanfb.com/archives/172723
https://en.wikipedia.org/wiki/Ensemble_learning
Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.
Emotion recognition[edit]
Main article: Emotion recognition
While speech recognition is mainly based on Deep Learning because most of the industry players in this field like Google, Microsoft and IBM reveal that the core technology of their speech recognition is based on this approach, speech-based emotion recognition can also have a satisfactory performance with ensemble learning.[53] [54]
It is also being successfully used in Facial Emotion Recognition.[55][56][57]
Fraud detection[edit]
Main article: Fraud detection
Fraud detection deals with the identification of bank fraud, such as money laundering, credit card fraud and telecommunication fraud, which have vast domains of research and iationications of Machine Learning. Because ensemble learning improves the robustness of the normal behavior modelling, it has been proposed as an efficient technique to detect such fraudulent cases and activities in banking and credit card systems.[58][59]
Stanford CS professor Percy Liang’s talk on NLU
Source: https://www.topbots.com/4-different-approaches-natural-language-processing-understanding/
You can have sequence-to-sequence models that use LSTMs/RNNs as modules inside them, where a “sequence-to-sequence model” is just a model that works for sequence to sequence tasks.
Source: https://www.quora.com/What-is-the-difference-between-LSTM-RNN-and-sequence-to-sequence
You can have sequence-to-sequence models that use LSTMs/RNNs as modules inside them, where a “sequence-to-sequence model” is just a model that works for sequence to sequence tasks.
Source: https://www.quora.com/What-is-the-difference-between-LSTM-RNN-and-sequence-to-sequence
Source: https://www.slideshare.net/TessFerrandez/notes-from-coursera-deep-learning-courses-by-andrew-ng?from_m_app=ios
No need to have huge amount of data, but it has to be processed
The capability to identify the problem of the error and adjust accordingly
Last but not least, Chatbot is not about training nlu, instead, more like Q&A pairing