Improving classification accuracy for customer contact transcriptions

Improving classification accuracy for
customer contact transcriptions
Data Science Lab – Text Analytics Group
Maria Vechtomova

Why do we want to automatically classify calls and chats?
● We receive 7M service calls per year
● Calls are expensive
● Service call reduction is an everlasting target
● Calls/chats are manually categorized by call agents
● Classification tree has 3 levels (with ~10, 70 and >400
categories on each level respectively)
● Manual call/chat classification is used to spot issues
affecting large number of customers, distribute the
workload and plan the capacity
Manual classification has number of problems:
● Only 60% of the calls are labelled (much less for chat)
● Agents are not particularly motivated in doing proper
labelling -> we doubt the accuracy of their
classification

Classification categories 1st level

Classification categories 2nd level

Where did we start?
We had data available:
● ~180k labeled call transcripts
● ~40k labeled chat transcripts
Supervised algorithms we tried:
● Naive bayes (as baseline algorithm)
● CNN (similar to Kim Yoon’s Convolutional Neural
Network for Sentence Classification, 2014)
● LSTM
● Hybrid model (CNN+LSTM)
Baseline we compared the performance with:
● Agent’s accuracy
● Accuracy of CNN on Amazon reviews to classify
categories and subcategories of the reviewed
products (6 categories, 64 subcategories)
Supervised learning

Chat example
Cat1: Cancellation
Cat2: Cancel subscription

Chat example
Cat1: Cancellation
Cat1: Order
Cat2: Subscription renewal

Chat example
Cat1: Cancellation
Cat1: Order
Cat2: Subscription renewal
Cat1: Order
Cat2: New subscription

Text classification with CNN
● The model architecture is close to Kim Yoon’s Convolutional Neural Network for Sentence
Classification https://arxiv.org/abs/1408.5882
● Inspired by blog of Denny Britz:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

Supervised learning performance
Comparison of model accuracy (cat1; cat2)

Performance does not seem very promising.
Possible reasons:

Possible reasons:
1. Customer interaction data is more complex than the review data:
● this is a conversation, therefore it includes multiple people
● a customer is likely to talk about multiple problems at the same time when people most likely
describe only one product in their reviews
● customer interaction is longer (in terms of number of words) than a review
● for call transcripts, there is a loss of quality when creating call transcripts

Possible reasons:
2. Labeling quality is significantly worse:
● agents often do not see benefit in labeling and select the first category in a drop-down menu
● each chat/ call has only one label while the conversations often goes about multiple things; and
that label is based on agent’s interpretation

Possible reasons:
2. Labeling quality is significantly worse:
● agents often do not see benefit in labeling and select the first category in a drop-down menu
● each chat/ call has only one label while the conversations often goes about multiple things; and
that label is based on agent’s interpretation
However, model performance for chats is very close to human accuracy:
61,3-71,3% on the 1st level;
52,1-69% on the 2nd level

How did we proceed?
Automatic chat classification (Tensorflow) is started:
● model performance is close to human accuracy
● creating new classification tree is not feasible
● creating new labeling is costly
Investigating ‘cheaper’ labeling/tagging alternatives:
● LDA
● tf-idf + k-means
● doc2vec + k-means
● avg word2vec + k-means
When evaluating techniques we paid attention on:
● interpretability
● stability
● ‘fitness’ to buckets (classification tree mapped to
smaller groups defined by business stakeholders)
Unsupervised learning

LDA
● Used gensim LDA implementation, based on 180k call transcripts
● After finding most typical words per topic, we ‘map’ topics to business ‘buckets’
Unsupervised learning performance

LDA
Buckets:
● Assurance
● Fulfillment - Order
● Fulfillment - Installation
● Internet & Telephony
● Sales
● Save
● TV
● Device & Proposition

LDA

LDA
● Mapping to ‘buckets’ is disappointing
● Some of the ‘buckets’ could not be mapped
● Some of created topics are not easy to interpret
LDA
topics
Buckets

● LDA seems to produce unstable results
● Even though stability can be improved by increasing number of documents, this issue is
hard to fix
LDA

● Used gensim implementation of doc2vec to generate document vectors
● k-means (k++ initialization, k=10) attempts to group the documents by similarity of
document vectors generated by Doc2Vec
● To understand what the clusters are, we:
- used the heatmap
- used TF-IDF to predict the clusters (decision tree) & measured feature importance
doc2vec + k-means

- used the heatmap
doc2vec + k-means
● Naturally created clusters are different from the
buckets
● However, there are few similarities:
- Cluster 7 has something to do with Assurance
- Cluster 9: with Toestel & Propositie
- Cluster 0: with TV/ Internet & Bellen

- used the heatmap
doc2vec + k-means
● Naturally created clusters are different from the
buckets
● However, there are few similarities:
- Cluster 7 has something to do with Assurance
- Cluster 9: with Toestel & Propositie
- Cluster 0: with TV/ Internet & Bellen
Another drawback: inferring document vectors from
doc2vec for new document classification is unstable
Can be solved with creating document vectors by
computing (weighted) average of word vectors

Unsupervised learning: main results & next steps
● Unsupervised learning techniques failed to come up
with categories meaningful and useful for the business
● Stability & interpretability is an issue
Possible improvements:
● Guided LDA
● Semi-supervised clustering (seeding)
Investigating ontology tagging:
● together with business stakeholders, define ontology
dimensions and ontology terms
● use word embeddings to create ontology term
synonyms, ontology & document vectors
● validate tagging techniques using a manually tagged
validation set

Semi-supervised learning: ontology

Klachtenformulier similar words:: formulier (0.77),
formuliertje (0.75), formulieren (0.69), klacht (0.68),
overlast (0.65), ticket (0.64),

1. Vector similarity approach
● Generate word vectors (gensim word2vec/ GloVe)
● Create document/ontology vectors by computing (weighted) average of vectors of
words belonging to the document/ontology
● Compute similarity between document and ontology vectors, tag the document with
ontology category if similarity exceeds the threshold
2. Topic word count approach
● Generate similar words for ontology terms
● Count number of ontology terms (and similar words) in a document
● Normalize the count
● Assign topic to the document if normalized count exceeds the threshold
Ontology tagging approaches

Ontology: vector similarity approach

Ontology: topic word count approach

Customer Journey: Cancel
word_count = (2 + 0.8)/1500
score = 2.8/1500 * topic_weight
score> threshold
CANCELLATION
0.8

0.7

Service: Internet
word_count = (3 + 0.8 + 4*0.7)/1500
score = 6.6/1500 * topic_weight
score> threshold
INTERNET
0.8
0.7

Vector similarity: results
Avg F1 score: 0.33

Topic word count: results
Avg F1 score: 0.38

Ontology: main learnings & next steps
● Ontology does not give the results we were hoping for
● Ontology categories are still quite vague and it is very
hard to come up with good (mutually exclusive) ones!
Possible improvements:
● Using POS parsers (Frog, spaCy)
● Using phrase embeddings
● Improve ontology categories and ontology terms
Advantages & when it can actually work
● it requires 10 times less labeled data (5k vs 50k) than a
supervised learning algorithm
● it is important to define the categories well (high
human accuracy)
● start labeling the data, go on with ontology, if not
successful, label more data and proceed to CNN

Improving classification accuracy for customer contact transcriptions

More Related Content

Similar to Improving classification accuracy for customer contact transcriptions

Recently uploaded

Improving classification accuracy for customer contact transcriptions