Improving classification accuracy for
customer contact transcriptions
Data Science Lab – Text Analytics Group
Maria Vechtomova
Why do we want to automatically classify calls and chats?
● We receive 7M service calls per year
● Calls are expensive
● Service call reduction is an everlasting target
● Calls/chats are manually categorized by call agents
● Classification tree has 3 levels (with ~10, 70 and >400
categories on each level respectively)
● Manual call/chat classification is used to spot issues
affecting large number of customers, distribute the
workload and plan the capacity
Manual classification has number of problems:
● Only 60% of the calls are labelled (much less for chat)
● Agents are not particularly motivated in doing proper
labelling -> we doubt the accuracy of their
classification
Classification categories 1st level
Classification categories 1st level
Classification categories 1st level
Classification categories 1st level
Classification categories 2nd level
Classification categories 2nd level
Classification categories 2nd level
Classification categories 2nd level
Where did we start?
We had data available:
● ~180k labeled call transcripts
● ~40k labeled chat transcripts
Supervised algorithms we tried:
● Naive bayes (as baseline algorithm)
● CNN (similar to Kim Yoon’s Convolutional Neural
Network for Sentence Classification, 2014)
● LSTM
● Hybrid model (CNN+LSTM)
Baseline we compared the performance with:
● Agent’s accuracy
● Accuracy of CNN on Amazon reviews to classify
categories and subcategories of the reviewed
products (6 categories, 64 subcategories)
Supervised learning
Chat example
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Cat1: Order
Cat2: Subscription renewal
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Cat1: Order
Cat2: Subscription renewal
Cat1: Order
Cat2: New subscription
Text classification with CNN
● The model architecture is close to Kim Yoon’s Convolutional Neural Network for Sentence
Classification https://arxiv.org/abs/1408.5882
● Inspired by blog of Denny Britz:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
Supervised learning performance
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
1. Customer interaction data is more complex than the review data:
● this is a conversation, therefore it includes multiple people
● a customer is likely to talk about multiple problems at the same time when people most likely
describe only one product in their reviews
● customer interaction is longer (in terms of number of words) than a review
● for call transcripts, there is a loss of quality when creating call transcripts
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
1. Customer interaction data is more complex than the review data:
● this is a conversation, therefore it includes multiple people
● a customer is likely to talk about multiple problems at the same time when people most likely
describe only one product in their reviews
● customer interaction is longer (in terms of number of words) than a review
● for call transcripts, there is a loss of quality when creating call transcripts
2. Labeling quality is significantly worse:
● agents often do not see benefit in labeling and select the first category in a drop-down menu
● each chat/ call has only one label while the conversations often goes about multiple things; and
that label is based on agent’s interpretation
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
1. Customer interaction data is more complex than the review data:
● this is a conversation, therefore it includes multiple people
● a customer is likely to talk about multiple problems at the same time when people most likely
describe only one product in their reviews
● customer interaction is longer (in terms of number of words) than a review
● for call transcripts, there is a loss of quality when creating call transcripts
2. Labeling quality is significantly worse:
● agents often do not see benefit in labeling and select the first category in a drop-down menu
● each chat/ call has only one label while the conversations often goes about multiple things; and
that label is based on agent’s interpretation
However, model performance for chats is very close to human accuracy:
61,3-71,3% on the 1st level;
52,1-69% on the 2nd level
Comparison of model accuracy (cat1; cat2)
How did we proceed?
Automatic chat classification (Tensorflow) is started:
● model performance is close to human accuracy
● creating new classification tree is not feasible
● creating new labeling is costly
Investigating ‘cheaper’ labeling/tagging alternatives:
● LDA
● tf-idf + k-means
● doc2vec + k-means
● avg word2vec + k-means
When evaluating techniques we paid attention on:
● interpretability
● stability
● ‘fitness’ to buckets (classification tree mapped to
smaller groups defined by business stakeholders)
Unsupervised learning
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Unsupervised learning performance
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Buckets:
● Assurance
● Fulfillment - Order
● Fulfillment - Installation
● Internet & Telephony
● Sales
● Save
● TV
● Device & Proposition
Unsupervised learning performance
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Unsupervised learning performance
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
LDA
Unsupervised learning performance
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Unsupervised learning performance
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Unsupervised learning performance
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Unsupervised learning performance
LDA
● Mapping to ‘buckets’ is disappointing
● Some of the ‘buckets’ could not be mapped
● Some of created topics are not easy to interpret
LDA
topics
Buckets
Unsupervised learning performance
● LDA seems to produce unstable results
● Even though stability can be improved by increasing number of documents, this issue is
hard to fix
Unsupervised learning performance
LDA
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to group the documents by similarity of
document vectors generated by Doc2Vec
● To understand what the clusters are, we:
- used the heatmap
- used TF-IDF to predict the clusters (decision tree) & measured feature importance
doc2vec + k-means
Unsupervised learning performance
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to group the documents by similarity of
document vectors generated by Doc2Vec
● To understand what the clusters are, we:
- used the heatmap
- used TF-IDF to predict the clusters (decision tree) & measured feature importance
doc2vec + k-means
Unsupervised learning performance
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to group the documents by similarity of
document vectors generated by Doc2Vec
● To understand what the clusters are, we:
- used the heatmap
- used TF-IDF to predict the clusters (decision tree) & measured feature importance
doc2vec + k-means
● Naturally created clusters are different from the
buckets
● However, there are few similarities:
- Cluster 7 has something to do with Assurance
- Cluster 9: with Toestel & Propositie
- Cluster 0: with TV/ Internet & Bellen
Unsupervised learning performance
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to group the documents by similarity of
document vectors generated by Doc2Vec
● To understand what the clusters are, we:
- used the heatmap
- used TF-IDF to predict the clusters (decision tree) & measured feature importance
doc2vec + k-means
● Naturally created clusters are different from the
buckets
● However, there are few similarities:
- Cluster 7 has something to do with Assurance
- Cluster 9: with Toestel & Propositie
- Cluster 0: with TV/ Internet & Bellen
Unsupervised learning performance
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to group the documents by similarity of
document vectors generated by Doc2Vec
● To understand what the clusters are, we:
- used the heatmap
- used TF-IDF to predict the clusters (decision tree) & measured feature importance
doc2vec + k-means
● Naturally created clusters are different from the
buckets
● However, there are few similarities:
- Cluster 7 has something to do with Assurance
- Cluster 9: with Toestel & Propositie
- Cluster 0: with TV/ Internet & Bellen
Unsupervised learning performance
Another drawback: inferring document vectors from
doc2vec for new document classification is unstable
Can be solved with creating document vectors by
computing (weighted) average of word vectors
Unsupervised learning: main results & next steps
● Unsupervised learning techniques failed to come up
with categories meaningful and useful for the business
● Stability & interpretability is an issue
Possible improvements:
● Guided LDA
● Semi-supervised clustering (seeding)
Investigating ontology tagging:
● together with business stakeholders, define ontology
dimensions and ontology terms
● use word embeddings to create ontology term
synonyms, ontology & document vectors
● validate tagging techniques using a manually tagged
validation set
Semi-supervised learning: ontology
Semi-supervised learning: ontology
Semi-supervised learning: ontology
Klachtenformulier similar words:: formulier (0.77),
formuliertje (0.75), formulieren (0.69), klacht (0.68),
overlast (0.65), ticket (0.64),
Semi-supervised learning: ontology
1. Vector similarity approach
● Generate word vectors (gensim word2vec/ GloVe)
● Create document/ontology vectors by computing (weighted) average of vectors of
words belonging to the document/ontology
● Compute similarity between document and ontology vectors, tag the document with
ontology category if similarity exceeds the threshold
2. Topic word count approach
● Generate similar words for ontology terms
● Count number of ontology terms (and similar words) in a document
● Normalize the count
● Assign topic to the document if normalized count exceeds the threshold
Ontology tagging approaches
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: topic word count approach
Semi-supervised learning: ontology
Ontology: topic word count approach
Customer Journey: Cancel
word_count = (2 + 0.8)/1500
score = 2.8/1500 * topic_weight
score> threshold
CANCELLATION
0.8
Semi-supervised learning: ontology
Ontology: topic word count approach
0.7
Semi-supervised learning: ontology
Ontology: topic word count approach
Service: Internet
word_count = (3 + 0.8 + 4*0.7)/1500
score = 6.6/1500 * topic_weight
score> threshold
INTERNET
0.8
0.7
Vector similarity: results
Avg F1 score: 0.33
Topic word count: results
Avg F1 score: 0.38
Ontology: main learnings & next steps
● Ontology does not give the results we were hoping for
● Ontology categories are still quite vague and it is very
hard to come up with good (mutually exclusive) ones!
Possible improvements:
● Using POS parsers (Frog, spaCy)
● Using phrase embeddings
● Improve ontology categories and ontology terms
Advantages & when it can actually work
● it requires 10 times less labeled data (5k vs 50k) than a
supervised learning algorithm
● it is important to define the categories well (high
human accuracy)
● start labeling the data, go on with ontology, if not
successful, label more data and proceed to CNN

Improving classification accuracy for customer contact transcriptions

  • 1.
    Improving classification accuracyfor customer contact transcriptions Data Science Lab – Text Analytics Group Maria Vechtomova
  • 2.
    Why do wewant to automatically classify calls and chats? ● We receive 7M service calls per year ● Calls are expensive ● Service call reduction is an everlasting target ● Calls/chats are manually categorized by call agents ● Classification tree has 3 levels (with ~10, 70 and >400 categories on each level respectively) ● Manual call/chat classification is used to spot issues affecting large number of customers, distribute the workload and plan the capacity Manual classification has number of problems: ● Only 60% of the calls are labelled (much less for chat) ● Agents are not particularly motivated in doing proper labelling -> we doubt the accuracy of their classification
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Where did westart? We had data available: ● ~180k labeled call transcripts ● ~40k labeled chat transcripts Supervised algorithms we tried: ● Naive bayes (as baseline algorithm) ● CNN (similar to Kim Yoon’s Convolutional Neural Network for Sentence Classification, 2014) ● LSTM ● Hybrid model (CNN+LSTM) Baseline we compared the performance with: ● Agent’s accuracy ● Accuracy of CNN on Amazon reviews to classify categories and subcategories of the reviewed products (6 categories, 64 subcategories) Supervised learning
  • 12.
  • 13.
  • 14.
  • 15.
    Chat example Cat1: Cancellation Cat2:Cancel subscription Cat1: Order Cat2: Subscription renewal
  • 16.
    Chat example Cat1: Cancellation Cat2:Cancel subscription Cat1: Order Cat2: Subscription renewal Cat1: Order Cat2: New subscription
  • 17.
    Text classification withCNN ● The model architecture is close to Kim Yoon’s Convolutional Neural Network for Sentence Classification https://arxiv.org/abs/1408.5882 ● Inspired by blog of Denny Britz: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
  • 18.
    Supervised learning performance Comparisonof model accuracy (cat1; cat2)
  • 19.
    Supervised learning performance Comparisonof model accuracy (cat1; cat2)
  • 20.
    Supervised learning performance Comparisonof model accuracy (cat1; cat2)
  • 21.
    Supervised learning performance Performancedoes not seem very promising. Possible reasons: Comparison of model accuracy (cat1; cat2)
  • 22.
    Supervised learning performance Performancedoes not seem very promising. Possible reasons: 1. Customer interaction data is more complex than the review data: ● this is a conversation, therefore it includes multiple people ● a customer is likely to talk about multiple problems at the same time when people most likely describe only one product in their reviews ● customer interaction is longer (in terms of number of words) than a review ● for call transcripts, there is a loss of quality when creating call transcripts Comparison of model accuracy (cat1; cat2)
  • 23.
    Supervised learning performance Performancedoes not seem very promising. Possible reasons: 1. Customer interaction data is more complex than the review data: ● this is a conversation, therefore it includes multiple people ● a customer is likely to talk about multiple problems at the same time when people most likely describe only one product in their reviews ● customer interaction is longer (in terms of number of words) than a review ● for call transcripts, there is a loss of quality when creating call transcripts 2. Labeling quality is significantly worse: ● agents often do not see benefit in labeling and select the first category in a drop-down menu ● each chat/ call has only one label while the conversations often goes about multiple things; and that label is based on agent’s interpretation Comparison of model accuracy (cat1; cat2)
  • 24.
    Supervised learning performance Performancedoes not seem very promising. Possible reasons: 1. Customer interaction data is more complex than the review data: ● this is a conversation, therefore it includes multiple people ● a customer is likely to talk about multiple problems at the same time when people most likely describe only one product in their reviews ● customer interaction is longer (in terms of number of words) than a review ● for call transcripts, there is a loss of quality when creating call transcripts 2. Labeling quality is significantly worse: ● agents often do not see benefit in labeling and select the first category in a drop-down menu ● each chat/ call has only one label while the conversations often goes about multiple things; and that label is based on agent’s interpretation However, model performance for chats is very close to human accuracy: 61,3-71,3% on the 1st level; 52,1-69% on the 2nd level Comparison of model accuracy (cat1; cat2)
  • 25.
    How did weproceed? Automatic chat classification (Tensorflow) is started: ● model performance is close to human accuracy ● creating new classification tree is not feasible ● creating new labeling is costly Investigating ‘cheaper’ labeling/tagging alternatives: ● LDA ● tf-idf + k-means ● doc2vec + k-means ● avg word2vec + k-means When evaluating techniques we paid attention on: ● interpretability ● stability ● ‘fitness’ to buckets (classification tree mapped to smaller groups defined by business stakeholders) Unsupervised learning
  • 26.
    LDA ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  • 27.
    LDA ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Buckets: ● Assurance ● Fulfillment - Order ● Fulfillment - Installation ● Internet & Telephony ● Sales ● Save ● TV ● Device & Proposition Unsupervised learning performance
  • 28.
    LDA ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  • 29.
    ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ LDA Unsupervised learning performance
  • 30.
    LDA ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  • 31.
    LDA ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  • 32.
    LDA ● Used gensimLDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  • 33.
    LDA ● Mapping to‘buckets’ is disappointing ● Some of the ‘buckets’ could not be mapped ● Some of created topics are not easy to interpret LDA topics Buckets Unsupervised learning performance
  • 34.
    ● LDA seemsto produce unstable results ● Even though stability can be improved by increasing number of documents, this issue is hard to fix Unsupervised learning performance LDA
  • 35.
    ● Used gensimimplementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means Unsupervised learning performance
  • 36.
    ● Used gensimimplementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means Unsupervised learning performance
  • 37.
    ● Used gensimimplementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means ● Naturally created clusters are different from the buckets ● However, there are few similarities: - Cluster 7 has something to do with Assurance - Cluster 9: with Toestel & Propositie - Cluster 0: with TV/ Internet & Bellen Unsupervised learning performance
  • 38.
    ● Used gensimimplementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means ● Naturally created clusters are different from the buckets ● However, there are few similarities: - Cluster 7 has something to do with Assurance - Cluster 9: with Toestel & Propositie - Cluster 0: with TV/ Internet & Bellen Unsupervised learning performance
  • 39.
    ● Used gensimimplementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means ● Naturally created clusters are different from the buckets ● However, there are few similarities: - Cluster 7 has something to do with Assurance - Cluster 9: with Toestel & Propositie - Cluster 0: with TV/ Internet & Bellen Unsupervised learning performance Another drawback: inferring document vectors from doc2vec for new document classification is unstable Can be solved with creating document vectors by computing (weighted) average of word vectors
  • 40.
    Unsupervised learning: mainresults & next steps ● Unsupervised learning techniques failed to come up with categories meaningful and useful for the business ● Stability & interpretability is an issue Possible improvements: ● Guided LDA ● Semi-supervised clustering (seeding) Investigating ontology tagging: ● together with business stakeholders, define ontology dimensions and ontology terms ● use word embeddings to create ontology term synonyms, ontology & document vectors ● validate tagging techniques using a manually tagged validation set
  • 41.
  • 42.
  • 43.
    Semi-supervised learning: ontology Klachtenformuliersimilar words:: formulier (0.77), formuliertje (0.75), formulieren (0.69), klacht (0.68), overlast (0.65), ticket (0.64),
  • 44.
    Semi-supervised learning: ontology 1.Vector similarity approach ● Generate word vectors (gensim word2vec/ GloVe) ● Create document/ontology vectors by computing (weighted) average of vectors of words belonging to the document/ontology ● Compute similarity between document and ontology vectors, tag the document with ontology category if similarity exceeds the threshold 2. Topic word count approach ● Generate similar words for ontology terms ● Count number of ontology terms (and similar words) in a document ● Normalize the count ● Assign topic to the document if normalized count exceeds the threshold Ontology tagging approaches
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
    Semi-supervised learning: ontology Ontology:topic word count approach Customer Journey: Cancel word_count = (2 + 0.8)/1500 score = 2.8/1500 * topic_weight score> threshold CANCELLATION 0.8
  • 52.
    Semi-supervised learning: ontology Ontology:topic word count approach 0.7
  • 53.
    Semi-supervised learning: ontology Ontology:topic word count approach Service: Internet word_count = (3 + 0.8 + 4*0.7)/1500 score = 6.6/1500 * topic_weight score> threshold INTERNET 0.8 0.7
  • 54.
  • 55.
    Topic word count:results Avg F1 score: 0.38
  • 56.
    Ontology: main learnings& next steps ● Ontology does not give the results we were hoping for ● Ontology categories are still quite vague and it is very hard to come up with good (mutually exclusive) ones! Possible improvements: ● Using POS parsers (Frog, spaCy) ● Using phrase embeddings ● Improve ontology categories and ontology terms Advantages & when it can actually work ● it requires 10 times less labeled data (5k vs 50k) than a supervised learning algorithm ● it is important to define the categories well (high human accuracy) ● start labeling the data, go on with ontology, if not successful, label more data and proceed to CNN