Language Identification for improve interactive voice response services using ConvNet
1. Language Identification
for improve interactive
voice response services
using ConvNet
Advisor : Worapol Pongpech
Rattanawadee Waipatin
Major : BA&I
2. Agenda
• Objective
• Literature review
• Methodology
• Feature Extraction - Why MFCC and Filter bank ?
• Result analysis
• Conclusion
2
3. Objective
0
10
20
30
40
50
60
Very important
Somewhat important
Not important
Customer aspect from global customer service report 2019
How important is
customer service in your
choice of, or loyalty, to a
brand?
0
10
20
30
40
50
60
70
Yes No
Have you ever stopped
doing business with a brand
due to a poor customer
service experience?
0
10
20
30
40
Yes No About the same
Do you feel the process of
engaging with customer
service organizations and
getting your questions
answered is getting easier?
3
https://info.microsoft.com/rs/157-GQE-382/images/EN-US-CNTNT-ebook-2018-State-of-Global-Customer-Service.pdf
4. Objective
0 20 40 60 80
Phone/Voice Email
Self service Live chat
Support ticket Social media
Customer aspect from global customer service report 2019
Which of the following
customer service channels
do you prefer?
0
20
40
60
80
Begin with self service
Engage Agent
When engaging with customer
service, do you try to use
self-service first, or do you
immediately engage with
an agent?
0
10
20
30
40
Resolving issue in one interaction
Knowledgeable agent
Not repeat information
Finding information myself
What is the most important
aspect of a good customer
service experience?
4
https://info.microsoft.com/rs/157-GQE-382/images/EN-US-CNTNT-ebook-2018-State-of-Global-Customer-Service.pdf
5. Objective
0
20
40
60
80
100
Interactive voice response
Contact Agent
Call center aspect from call center data in Thailand 2019
Contact agent VS using
Interactive voice response
on call center.
Half of calls are unsuccess
using Interactive voice
response service.
41% of unsuccess Interactive
voice response call repeat a
call with in 1 day and try to
use it again.
0
20
40
60
80
100
Unsuccess IVR
Success IVR
Contact Agent
0
10
20
30
40
50
60
70
80
90
100
Contact Agent
Success IVR
Unsuccess IVR
Repeat call
5
https://info.microsoft.com/rs/157-GQE-382/images/EN-US-CNTNT-ebook-2018-State-of-Global-Customer-Service.pdf
6. Objective
Call
Language
selection
Topic
selection
End call
Press the number to select language
AIS and Krungsri are
using speech recognition
Call
Language
selection
Topic
selection
End call
Press the number to select language and topic
Most existing IVR
New IVR
Replace with model
6
7. 7
Literature review
Year Author Model Basis Features Languages Accuracy Remarks
2019
Sarthak, Shikhar Shukla
and Govind Mittal
1D ConvNet Raw
En, Fr, De, Es, Ru,
It
93.7
Comparing 1D and 2D ConvNet and with
raw and log-Mel feature extraction2D ConvNet log-Mel
En, Fr, De, Es, Ru,
It
95.4
2D ConvNet log-Mel En, Fr, De, Es 96.3
2019
Shauna Revay
and Matthew Teschke
ResNet50 log-Mel
En, Fr, De, Es, Ru,
It
89.0
Use a pretrained ResNet50 architecture and
cyclic learner to identify the language
2018
Valentin Gazeau and Cihan
Varol
SVM-HMM - En, Fr, De, Es 70.0
HMM was used to encode speech into
sequences of vectors which were then fed
into a neural network
2017
Christian Bartz, Tom
Herold, Haojin Yang and
Christoph Meinel
CRNN log-Mel En, Fr, De, Es 91.0
A new architecture is used to extract spatial
features by using CNN and temporal
features by using RNN
2010
Pawan Kumar, Astik
Biswas, A .N. Mishra and
Mahesh Chandra
Gaussian
mixture model
Perceptual
linear
prediction
En, Fr, De, Es, Ru,
It, Dut, Ben, Hi,
Tel
88.8
Used GMM which features were prepared
using PLP
9. 9
Feature Extraction - Why MFCC and Filter bank ?
Mel-Frequency Cepstral Coefficients (MFCCs) are coefficients that collectively representation of the short-term power spectrum of a
sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency
Filter bank is an array of band-pass filters that separates the input signal into multiple components. The main use of filter banks is to
divide a signal into several separate frequency domains. (apply filter on Mel-scale to the power spectrum to extract frequency bands)
https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
Pre-Emphasis Framing
Fourier-
Transform
Filter Banks MFCCs
Mean
Normalization
(1) balance the
frequency spectrum
(2) avoid numerical
problems (FT)
(3) improve the Signal-
to-Noise Ratio (SNR)
Obtain a good
approximation of the
frequency contours
Raw After MFCCs
Balance spectrum and
improve SNR
10. 10
Result analysis
2 languages
• Thai
• English
3 languages
• Chinese
• English
• Thai
epoch
Accuracy
Loss
Accuracy
Loss
epoch
Accuracy : 99.2%
354
49.17%
360
50.00%
0
0.00%
6
0.83%
Eng
Eng
Thai
Thai
True
label
Predicted label
Accuracy : 97.8%True
label
6
0.01%
8
0.01%
Predicted label
Thai
Thai 360
33.33%
0
0.00%
0
0.00%
Chi
Chi
Eng
Eng 5
0.01%
347
32.12%
349
32.31%
5
0.01%
11. 11
Conclusion / How to improve
• Using 2 or 3 languages class is depending on your customer nationality.
• Adding more training data sets from call center data would be more robust when using
with your call center system.