Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improving classification accuracy for
customer contact transcriptions
Data Science Lab – Text Analytics Group
Maria Vechto...
Why do we want to automatically classify calls and chats?
● We receive 7M service calls per year
● Calls are expensive
● S...
Classification categories 1st level
Classification categories 1st level
Classification categories 1st level
Classification categories 1st level
Classification categories 2nd level
Classification categories 2nd level
Classification categories 2nd level
Classification categories 2nd level
Where did we start?
We had data available:
● ~180k labeled call transcripts
● ~40k labeled chat transcripts
Supervised alg...
Chat example
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Cat1: Order
Cat2: Subscription renewal
Chat example
Cat1: Cancellation
Cat2: Cancel subscription
Cat1: Order
Cat2: Subscription renewal
Cat1: Order
Cat2: New sub...
Text classification with CNN
● The model architecture is close to Kim Yoon’s Convolutional Neural Network for Sentence
Cla...
Supervised learning performance
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Comparison of model accuracy (cat1; cat2)
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
Comparison of model accuracy (...
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
1. Customer interaction data i...
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
1. Customer interaction data i...
Supervised learning performance
Performance does not seem very promising.
Possible reasons:
1. Customer interaction data i...
How did we proceed?
Automatic chat classification (Tensorflow) is started:
● model performance is close to human accuracy
...
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map...
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map...
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map...
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ to...
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map...
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map...
LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map...
LDA
● Mapping to ‘buckets’ is disappointing
● Some of the ‘buckets’ could not be mapped
● Some of created topics are not e...
● LDA seems to produce unstable results
● Even though stability can be improved by increasing number of documents, this is...
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to grou...
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to grou...
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to grou...
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to grou...
● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to grou...
Unsupervised learning: main results & next steps
● Unsupervised learning techniques failed to come up
with categories mean...
Semi-supervised learning: ontology
Semi-supervised learning: ontology
Semi-supervised learning: ontology
Klachtenformulier similar words:: formulier (0.77),
formuliertje (0.75), formulieren (0...
Semi-supervised learning: ontology
1. Vector similarity approach
● Generate word vectors (gensim word2vec/ GloVe)
● Create...
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: vector similarity approach
Semi-supervised learning: ontology
Ontology: topic word count approach
Semi-supervised learning: ontology
Ontology: topic word count approach
Customer Journey: Cancel
word_count = (2 + 0.8)/150...
Semi-supervised learning: ontology
Ontology: topic word count approach
0.7
Semi-supervised learning: ontology
Ontology: topic word count approach
Service: Internet
word_count = (3 + 0.8 + 4*0.7)/15...
Vector similarity: results
Avg F1 score: 0.33
Topic word count: results
Avg F1 score: 0.38
Ontology: main learnings & next steps
● Ontology does not give the results we were hoping for
● Ontology categories are st...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Improving classification accuracy for customer contact transcriptions

Download to read offline

Reducing number of customer contacts by making processes more efficient is an everlasting target for KPN. Business users can get insights on why people call and spot areas of improvement using call classification. Currently KPN has a classification tree in place, and calls and chats are being classified manually. We try to use the input to classify calls and chats using supervised machine learning techniques (CNN, LSTM, hybrid models). The dataset is challenging in many ways and human accuracy on such a dataset is low. Poorly separated classes in the classification tree play a big role in it. Unfortunately, creating a new classification tree does not seem feasible because of high costs involved, and we tried 'cheaper' unsupervised and semi-supervised techniques which were not successful. We discuss the reasons of failure and possible next steps.

  • Be the first to like this

Improving classification accuracy for customer contact transcriptions

  1. 1. Improving classification accuracy for customer contact transcriptions Data Science Lab – Text Analytics Group Maria Vechtomova
  2. 2. Why do we want to automatically classify calls and chats? ● We receive 7M service calls per year ● Calls are expensive ● Service call reduction is an everlasting target ● Calls/chats are manually categorized by call agents ● Classification tree has 3 levels (with ~10, 70 and >400 categories on each level respectively) ● Manual call/chat classification is used to spot issues affecting large number of customers, distribute the workload and plan the capacity Manual classification has number of problems: ● Only 60% of the calls are labelled (much less for chat) ● Agents are not particularly motivated in doing proper labelling -> we doubt the accuracy of their classification
  3. 3. Classification categories 1st level
  4. 4. Classification categories 1st level
  5. 5. Classification categories 1st level
  6. 6. Classification categories 1st level
  7. 7. Classification categories 2nd level
  8. 8. Classification categories 2nd level
  9. 9. Classification categories 2nd level
  10. 10. Classification categories 2nd level
  11. 11. Where did we start? We had data available: ● ~180k labeled call transcripts ● ~40k labeled chat transcripts Supervised algorithms we tried: ● Naive bayes (as baseline algorithm) ● CNN (similar to Kim Yoon’s Convolutional Neural Network for Sentence Classification, 2014) ● LSTM ● Hybrid model (CNN+LSTM) Baseline we compared the performance with: ● Agent’s accuracy ● Accuracy of CNN on Amazon reviews to classify categories and subcategories of the reviewed products (6 categories, 64 subcategories) Supervised learning
  12. 12. Chat example
  13. 13. Chat example Cat1: Cancellation Cat2: Cancel subscription
  14. 14. Chat example Cat1: Cancellation Cat2: Cancel subscription
  15. 15. Chat example Cat1: Cancellation Cat2: Cancel subscription Cat1: Order Cat2: Subscription renewal
  16. 16. Chat example Cat1: Cancellation Cat2: Cancel subscription Cat1: Order Cat2: Subscription renewal Cat1: Order Cat2: New subscription
  17. 17. Text classification with CNN ● The model architecture is close to Kim Yoon’s Convolutional Neural Network for Sentence Classification https://arxiv.org/abs/1408.5882 ● Inspired by blog of Denny Britz: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
  18. 18. Supervised learning performance Comparison of model accuracy (cat1; cat2)
  19. 19. Supervised learning performance Comparison of model accuracy (cat1; cat2)
  20. 20. Supervised learning performance Comparison of model accuracy (cat1; cat2)
  21. 21. Supervised learning performance Performance does not seem very promising. Possible reasons: Comparison of model accuracy (cat1; cat2)
  22. 22. Supervised learning performance Performance does not seem very promising. Possible reasons: 1. Customer interaction data is more complex than the review data: ● this is a conversation, therefore it includes multiple people ● a customer is likely to talk about multiple problems at the same time when people most likely describe only one product in their reviews ● customer interaction is longer (in terms of number of words) than a review ● for call transcripts, there is a loss of quality when creating call transcripts Comparison of model accuracy (cat1; cat2)
  23. 23. Supervised learning performance Performance does not seem very promising. Possible reasons: 1. Customer interaction data is more complex than the review data: ● this is a conversation, therefore it includes multiple people ● a customer is likely to talk about multiple problems at the same time when people most likely describe only one product in their reviews ● customer interaction is longer (in terms of number of words) than a review ● for call transcripts, there is a loss of quality when creating call transcripts 2. Labeling quality is significantly worse: ● agents often do not see benefit in labeling and select the first category in a drop-down menu ● each chat/ call has only one label while the conversations often goes about multiple things; and that label is based on agent’s interpretation Comparison of model accuracy (cat1; cat2)
  24. 24. Supervised learning performance Performance does not seem very promising. Possible reasons: 1. Customer interaction data is more complex than the review data: ● this is a conversation, therefore it includes multiple people ● a customer is likely to talk about multiple problems at the same time when people most likely describe only one product in their reviews ● customer interaction is longer (in terms of number of words) than a review ● for call transcripts, there is a loss of quality when creating call transcripts 2. Labeling quality is significantly worse: ● agents often do not see benefit in labeling and select the first category in a drop-down menu ● each chat/ call has only one label while the conversations often goes about multiple things; and that label is based on agent’s interpretation However, model performance for chats is very close to human accuracy: 61,3-71,3% on the 1st level; 52,1-69% on the 2nd level Comparison of model accuracy (cat1; cat2)
  25. 25. How did we proceed? Automatic chat classification (Tensorflow) is started: ● model performance is close to human accuracy ● creating new classification tree is not feasible ● creating new labeling is costly Investigating ‘cheaper’ labeling/tagging alternatives: ● LDA ● tf-idf + k-means ● doc2vec + k-means ● avg word2vec + k-means When evaluating techniques we paid attention on: ● interpretability ● stability ● ‘fitness’ to buckets (classification tree mapped to smaller groups defined by business stakeholders) Unsupervised learning
  26. 26. LDA ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  27. 27. LDA ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Buckets: ● Assurance ● Fulfillment - Order ● Fulfillment - Installation ● Internet & Telephony ● Sales ● Save ● TV ● Device & Proposition Unsupervised learning performance
  28. 28. LDA ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  29. 29. ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ LDA Unsupervised learning performance
  30. 30. LDA ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  31. 31. LDA ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  32. 32. LDA ● Used gensim LDA implementation, based on 180k call transcripts ● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’ Unsupervised learning performance
  33. 33. LDA ● Mapping to ‘buckets’ is disappointing ● Some of the ‘buckets’ could not be mapped ● Some of created topics are not easy to interpret LDA topics Buckets Unsupervised learning performance
  34. 34. ● LDA seems to produce unstable results ● Even though stability can be improved by increasing number of documents, this issue is hard to fix Unsupervised learning performance LDA
  35. 35. ● Used gensim implementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means Unsupervised learning performance
  36. 36. ● Used gensim implementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means Unsupervised learning performance
  37. 37. ● Used gensim implementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means ● Naturally created clusters are different from the buckets ● However, there are few similarities: - Cluster 7 has something to do with Assurance - Cluster 9: with Toestel & Propositie - Cluster 0: with TV/ Internet & Bellen Unsupervised learning performance
  38. 38. ● Used gensim implementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means ● Naturally created clusters are different from the buckets ● However, there are few similarities: - Cluster 7 has something to do with Assurance - Cluster 9: with Toestel & Propositie - Cluster 0: with TV/ Internet & Bellen Unsupervised learning performance
  39. 39. ● Used gensim implementation of doc2vec to generate document vectors ● k-means (k++ initialization, k=10) attempts to group the documents by similarity of document vectors generated by Doc2Vec ● To understand what the clusters are, we: - used the heatmap - used TF-IDF to predict the clusters (decision tree) & measured feature importance doc2vec + k-means ● Naturally created clusters are different from the buckets ● However, there are few similarities: - Cluster 7 has something to do with Assurance - Cluster 9: with Toestel & Propositie - Cluster 0: with TV/ Internet & Bellen Unsupervised learning performance Another drawback: inferring document vectors from doc2vec for new document classification is unstable Can be solved with creating document vectors by computing (weighted) average of word vectors
  40. 40. Unsupervised learning: main results & next steps ● Unsupervised learning techniques failed to come up with categories meaningful and useful for the business ● Stability & interpretability is an issue Possible improvements: ● Guided LDA ● Semi-supervised clustering (seeding) Investigating ontology tagging: ● together with business stakeholders, define ontology dimensions and ontology terms ● use word embeddings to create ontology term synonyms, ontology & document vectors ● validate tagging techniques using a manually tagged validation set
  41. 41. Semi-supervised learning: ontology
  42. 42. Semi-supervised learning: ontology
  43. 43. Semi-supervised learning: ontology Klachtenformulier similar words:: formulier (0.77), formuliertje (0.75), formulieren (0.69), klacht (0.68), overlast (0.65), ticket (0.64),
  44. 44. Semi-supervised learning: ontology 1. Vector similarity approach ● Generate word vectors (gensim word2vec/ GloVe) ● Create document/ontology vectors by computing (weighted) average of vectors of words belonging to the document/ontology ● Compute similarity between document and ontology vectors, tag the document with ontology category if similarity exceeds the threshold 2. Topic word count approach ● Generate similar words for ontology terms ● Count number of ontology terms (and similar words) in a document ● Normalize the count ● Assign topic to the document if normalized count exceeds the threshold Ontology tagging approaches
  45. 45. Semi-supervised learning: ontology Ontology: vector similarity approach
  46. 46. Semi-supervised learning: ontology Ontology: vector similarity approach
  47. 47. Semi-supervised learning: ontology Ontology: vector similarity approach
  48. 48. Semi-supervised learning: ontology Ontology: vector similarity approach
  49. 49. Semi-supervised learning: ontology Ontology: vector similarity approach
  50. 50. Semi-supervised learning: ontology Ontology: topic word count approach
  51. 51. Semi-supervised learning: ontology Ontology: topic word count approach Customer Journey: Cancel word_count = (2 + 0.8)/1500 score = 2.8/1500 * topic_weight score> threshold CANCELLATION 0.8
  52. 52. Semi-supervised learning: ontology Ontology: topic word count approach 0.7
  53. 53. Semi-supervised learning: ontology Ontology: topic word count approach Service: Internet word_count = (3 + 0.8 + 4*0.7)/1500 score = 6.6/1500 * topic_weight score> threshold INTERNET 0.8 0.7
  54. 54. Vector similarity: results Avg F1 score: 0.33
  55. 55. Topic word count: results Avg F1 score: 0.38
  56. 56. Ontology: main learnings & next steps ● Ontology does not give the results we were hoping for ● Ontology categories are still quite vague and it is very hard to come up with good (mutually exclusive) ones! Possible improvements: ● Using POS parsers (Frog, spaCy) ● Using phrase embeddings ● Improve ontology categories and ontology terms Advantages & when it can actually work ● it requires 10 times less labeled data (5k vs 50k) than a supervised learning algorithm ● it is important to define the categories well (high human accuracy) ● start labeling the data, go on with ontology, if not successful, label more data and proceed to CNN

Reducing number of customer contacts by making processes more efficient is an everlasting target for KPN. Business users can get insights on why people call and spot areas of improvement using call classification. Currently KPN has a classification tree in place, and calls and chats are being classified manually. We try to use the input to classify calls and chats using supervised machine learning techniques (CNN, LSTM, hybrid models). The dataset is challenging in many ways and human accuracy on such a dataset is low. Poorly separated classes in the classification tree play a big role in it. Unfortunately, creating a new classification tree does not seem feasible because of high costs involved, and we tried 'cheaper' unsupervised and semi-supervised techniques which were not successful. We discuss the reasons of failure and possible next steps.

Views

Total views

155

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

8

Shares

0

Comments

0

Likes

0

×