Automation of IT Ticket Automation using NLP and Deep Learning

CAPSTONE - NLP
Automatic Ticket Assignment
1. Amit Mentor
2. Pranov Student
3. Hari Student
4. Bhargav Student
5. Nikhil Student
6. Vamsi Student
Abstract
Automating the task of Ticket Assignment by using advanced Machine Learning algorithms.

Table of Contents
Summary..........................................................................................................................2
Problem Statement..................................................................................................................2
Primary Objective ....................................................................................................................2
Significant Findings ..................................................................................................................2
Implications...................................................................................Error! Bookmark not defined.
Overview................................................................................Error! Bookmark not defined.
Machine Learning methodology ...............................................................................................4
EDA and data pre-processing....................................................................................................6
Basic EDA.......................................................................................................................................................6
Data Pre-processing ......................................................................................................................................7
Walkthrough of solution.........................................................Error! Bookmark not defined.
Model evaluation ...................................................................Error! Bookmark not defined.
Comparison to Benchmark..............................................................................................31
Visualisations................................................................................................................. 31
Implications....................................................................................................................31
Limitations .....................................................................................................................32
Closing Reflections..........................................................................................................32

Summary
Problem Statement
An incident is something that is unplanned interruption to an IT service or reduction in the
quality of an IT service that affects the Users and the Business. The main goal of Incident
Management process is to provide a quick fix /workarounds or solutions that resolves the
interruption and restores the service to its full capacity to ensure no business impact.
In the support process, incoming incidents are analysed and assessed by organization’s
support teams to fulfil the request. In many organizations, better allocation and effective
usage of the valuable support resources will directly result in substantial cost savings.
Guided by powerful AI techniques that can classify incidents to right functional groups can
help organizations to reduce the resolving time of the issue and can focus on more productive
tasks.
Primary Objective
Automate incidents assignment to right functional groups, based on the provided data, to a
certain level of accuracy which helps in saving time and effort of L1/L2 groups.
The effort and time saved by automation should result in quick and better issue resolutions.
Significant Findings
Based on problem description and information
1. L1/L2 spend a lot of time (almost 1 FTE) in assigning issues to L3/Functional groups.
Apart from assigning issues, L1/L2 also resolved issues

2. There were multiple instances of incidents getting assigned to wrong functional
groups.

3. Additional effort was needed for Functional teams to re-assign to right functional
groups.
4. And, as a consequence, some of the incidents are in queue and not addressed timely
resulting in poor customer service.
Benchmark
In real world scenario,
1. Even if 50% percent of the assignment work is automated, it will save a lot of time
and effort for L1/L2.
2. This will also improve assignment accuracy and faster issue resolution.
3. In other words, this time and effort saved translates to cost savings and customer
satisfaction.
Overview and Walkthrough of solution
Our objective in this project is to automate ticket assignment to the respective groups.
Machine Learning methodology for automatic assignment solution
Below are the key steps followed to implement the solution.
1. Understand the problem and identify the goal.
a. We have identified that the major issues with existing solution, manual
assignment of incidents to functional groups, is L1/L2 groups have to put a lot
of time and effort on incident assignments and there are instances of wrong
assignments also.
b. The objective is to automate incident assignments to right functional groups,
based on the provided data, to a certain level of accuracy which helps in
saving time and effort of L1/L2 groups.
2. EDA (Exploratory Data Analysis) and data pre-processing of given dataset
a. Understanding the characteristics of the data in the given dataset.
i. In the given dataset, we have 8500 rows and 4 columns: Short
description, Description, Caller and Assignment group.
ii. There are 74 unique teams/assignment groups to which the received
tickets are assigned.
iii. GRP_0 has nearly half of the total number of tickets received.
iv. A lot of ticket assignments with a frequency of less than 100.
v. A lot of groups with only ticket assigned which do not add much value
to any machine learning framework.

So, based on the analysis, all tickets with frequency less than 100 or 200 have
been into aggregated into one group to make the data a little less imbalanced.
b. Data pre-processing done in this project to convert the raw data to model
consumable data.
i. Tokenisation of the descriptions to prepare it for data processing.
ii. Removing stop words and other commonly occurring words
iii. N-gram models to improve classification.
iv. TF-IDF embeddings for traditional models
v. GLoVe embeddings for Neural Network based models.
Few other techniques like Polarity analysis and Text Readability analysis have
been done to verify the correlation between these metrics and classification.
Finally, the given dataset, after aggregation of groups and consolidation of
text, was divided into Train/Validation and Test datasets.
The final accuracy is measured on Test dataset.
3. Model evaluation based on performance metrics done in this project.
For this project, several models as described below were evaluated
a. Traditional models like Naïve Bayes, SVM, Random Forest and Xgboost.
b. Neural Networks like Bi-Directional LSTM, CNN with LSTM and GRU,
Transformer-Attention model and an ensemble Bi-Directional LSTM model.
By using a classification report with class-wise metrics like Precision, Recall, F1 score
and Support. Accuracy was also measured.
4. Finalize and tune the model
After trying different traditional models and neural networks,
the ensemble model constructed by combining similar Bi-directional LSTM model
gave best results in this project.
It was able to predict Group 0 , which has the majority of assignments, with a
precision of 0.83 and an overall accuracy of 0.73.
5. Predict using the model, with a certain confidence level, in real life.
a. The model developed in the project predicts Group 0 assignments with a
precision of 0.83.
b. This has already exceeded the benchmark based on the current manual
process.
c. By using this automation, time and effort on assignments will be reduced.
In future, with more advanced techniques like SMOTE for data balancing, the
model can predict at a better confidence level.

EDA and data pre-processing
Basic EDA
EDA(Exploratory Data Analysis) is a method of analysing a dataset guided by what we look
for, how we look, and how we interpret.
The tools used in this project for EDA are statistical and graphical techniques which are often
simple graphs and plots:
1. Plotting the provided(raw) data into graphs like histograms, probability plots etc.
A countplot used in this project to see frequency distribution of all ticket
assignments.
Similar plot after creation of “Grouped Assignment” class which comprises of all
classes which have less than 100/200 tickets assigned to them.
2. Verifying presence of NULLs and Identifying rows with missing values

After identifying the missing values, they were handled by replacing missing
Descriptions by Short Descriptions and vice-versa.
Data Pre-processing
Data Pre-processing refers to analysis and activities involved in transforming the data to such
a state that a machine can easily parse it.
Raw text is unstructured consisting of text , dates , numbers and facts which cannot be
interpreted by a machine.
The text data is first cleaned and prepared so that it can be consumed by a ML Algorithm.
Tokenization is the process of chopping the raw text into pieces, called tokens.
Tokens are usually equated to words but they can comprise of miniature groups of words or
sequences.
In this project, Tokenization has been used in several contexts:
1. Tokenize to remove stop words using NLTK.
2. Tokenize at word level, N-gram level and character level for TF-IDF
3. Tokenize the text for neural network model building.
Stop words are usually the most common words in a language.
But, depending on the problem and context, few other commonly used words within the
corpus can be considered as stop words.
Stop words and punctuations are usually removed, to improve performance, before giving
the text to ML algorithms.
In this project, removal of common English stop words has been done.
There was an attempt to identify commonly used words used, within corpus, and remove
them. But, this did not yield better results.

Word clouds are graphical representations of word frequency that give greater prominence
to most frequent words in the text.
They can be used to identify stop words.
A word cloud highlighting the stops words in our project data.

After removal of stop words.
N-gram is a sequence of N words. For example, “The Final Report” has three words and so
is a tri-gram.
if we assign a probability to the occurrence of an N-gram, It can help make next word
predictions. We can also see if there is a relation between certain N-grams and the
classification.

Top 20 trigrams in headlines.
Polarity means emotion expressed in the sentence. It can be positive, negative or neutral.
In this project, an attempt has been made to see if there can be a relation between Polarity and
classification to a certain group.
TextBlob package in Python can be used to determine polarity.

Representing assignments by Polarity.
Readability (index) is a metric to quantify how hard a text is to read. There are several
methods to determine readability indices like Flesch Reading Ease, Dale Chall Readability
Score, and Gunning Fog Index.
TextStat library in Python can be used to determine these readability indices.
In this project, an attempt has been made to see if there is a relation between the readability
index and classification (that is, group assignment)
It has been observed that the readability of descriptions assigned to Group_0 is higher if we
go by Flesh Reading score.
Aggregations can be done to put the data in a better perspective.
Behaviour of this grouped data is more stable than individual data objects.
In the dataset provided for this project, there are lot of ticket assignments with a frequency of
less than 100. All such tickets with frequency less than 100/200 have been aggregated into
one group to make the data a little less imbalanced.
Feature Encoding or Feature extraction is the process of converting text to numbers so that
the Machine Learning algorithm can efficiently process it.
TF-IDF, which stands for term frequency — inverse document frequency, is a scoring
measure intended to reflect how relevant a term is in a given document.
The idea is if a word occurs frequently in a document, we should boost its relevance as it
should be more meaningful than other words. (TF)

But, if it occurs frequently in many other documents also, maybe it is because this word is
just a frequent word and not because it was relevant or meaningful.(IDF)
How TF-IDF works?
The product of TF and IDF gives the TF-IDF.
Consider term ‘t’ , assign weight in the document d such that
 Weight is highest when t occurs many times within a document.
 Lower when the term occurs in many documents
 Lowest when the term occurs almost in all documents
In this project, TF-IDF has been used to vectorise N-gram datasets (derived from raw data).
This vectorised data is then used as input for traditional ML models.
Word embeddings are a type of word representation that allows words with similar meaning
to have a similar representation. They are basically lower dimensional, dense vector
representations for words.
While TF-IDF can also be called as embedding, it consists for sparse vectors which does not
add much meaning to the representation.
GloVe from Stanford and Word2vec from Google are popular word embeddings used for
NLP.
GloVe stands for Global Vectors for Word Representation.

 It is an unsupervised learning algorithm for obtaining vector representations for
words.
 It captures word-word co-occurrences in the entire corpus better.
 It suppresses rare co-occurrences and prevents frequent co-occurrences from taking
over and
 GloVe is more accurate than Word2vec.
GLoVe vs other techniques.
Finally, GLoVe pre-trained weights can be used as embedding for Neural Network models.
In this project, GLoVe pre-trained weights are used as embedding layer for the different
Neural Network based models.
Model Evaluation
Let’s try couple of classification Algorithms for our model building.
Classification Algorithms
The classification step usually involves a statistical model like Naïve Bayes, Logistic
Regression, Support Vector Machines, or Neural Networks:
 Naïve Bayes: a family of probabilistic algorithms that uses Bayes Theorem to predict
the category of a text.
 Linear Regression: a very well-known algorithm in statistics used to predict some
value (Y) given a set of features (X).
 Support Vector Machines: a non-probabilistic model which uses a representation of
text examples as points in a multidimensional space. Examples of different categories
(sentiments) are mapped to distinct regions within that space. Then, new texts are
assigned a category based on similarities with existing texts and the regions they’re
mapped to.

 Deep Learning: a diverse set of algorithms that attempt to mimic the human brain, by
employing artificial neural networks to process data.
Traditional ML Models
Machine learning is an algorithm or model that learns patterns in data and then predicts
similar patterns in new data. For example, if you want to classify children’s books, it would
mean that instead of setting up precise rules for what constitutes a children’s book,
developers can feed the computer hundreds of examples of children’s books. The computer
finds the patterns in these books and uses that pattern to identify future books in that
category.
Essentially, ML is a subset of artificial intelligence that enables computers to learn without
being explicitly programmed with predefined rules. It focuses on the development of
computer programs that can teach themselves to grow and change when exposed to new data.
This predictive ability, in addition to the computer’s ability to process massive amounts of
data, enables ML to handle complex business situations with efficiency and accuracy.
Let’s look at the different machine learning models used for this particular problem
statement.
Naïve Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence among
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, all
of these properties independently contribute to the probability that this fruit is an apple and that is
why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.
See code implementation of Naive Bayes for this particular problem -
Results:
The accuracy and precision achieved are way low to be considered!!!
Support Vector Machine

Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification problems. In
the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular coordinate. Then, we
perform classification by finding the hyper-plane that differentiates the two classes very well (look at
the below snapshot).
Support Vectors are simply the co-ordinates of individual observation. The SVM classifier is a frontier
which best segregates the two classes (hyper-plane/ line).
Please see code snippet used to implement SVM -
Again, the results are not good enough to be considered!!!
Random Forest
Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we
have 1000 observation in the complete population with 10 variables. Random forest tries to
build multiple CART models with different samples and different initial variables. For
instance, it will take a random sample of 100 observation and 5 randomly chosen initial
variables to build a CART model. It will repeat the process (say) 10 times and then make a

final prediction on each observation. Final prediction is a function of each prediction. This
final prediction can simply be the mean of each prediction. See below illustrative diagram -
Code snippet for Random Forest Model below –

The results are not encouraging with Traditional ML models, let’s try Deep Learning models
to see performance.
Deep Learning
Deep learning text classification model architectures generally consist of the following
components connected in sequence.
Architecture:
Let’s go through one by one by with the code developed for our automated ticket
classification problem.
Embedding Layer
Word Embedding is a representation of text where words that have the same
meaning have a similar representation. In other words, it represents words in a
coordinate system where related words, based on a corpus of relationships, are placed
closer together. In the deep learning frameworks such as TensorFlow, Keras, this part
is usually handled by an embedding layer which stores a lookup table to map the
words represented by numeric indexes to their dense vector representations.
Two popular examples of methods of learning word embeddings from text include:
Word2Vec
Word2Vec is the first neural embedding model (or at least the first, which
gained its popularity in 2013) and still the one, which is used by the most of
researchers. Doc2Vec, its child, is also the most popular model for
paragraphs representation, which was inspired by Word2Vec.
Glove(Global Vectors for Word Representation)
The approach of global word representation is used to capture the meaning
of one word embedding with the structure of the whole observed corpus;
word frequency and co-occurrence counts are the main measures on which
the majority of unsupervised algorithms are based on. Glove model trains on
global co-occurrence counts of words and makes a sufficient use of statistics
by minimizing least-squares error and, as result, producing a word vector
space with meaningful substructure. Such an outline sufficiently preserves
words similarities with vector distance.

Please see below code used for glove word embedding.
From the above, we have picked “glove.6B.200d.txt” because it is having
glove embeddings with 6 Billion words with each word being a 200-
dimensional vector.
Since the embedding size has defined as 200, we require 200-dimensional file
of glove pre-defined embedding. Similarly for any problem , the correct
embedding file should be used based on the embedding size is defined.
Deep Network
Deep network takes the sequence of embedding vectors as input and converts them to
a compressed representation. The compressed representation effectively captures all
the information in the sequence of words in the text. The deep network part is usually
an RNN or some forms of it like LSTM/GRU. The dropout is added to overcome the
tendency to overfit, a very common problem with RNN based networks.

LSTM Networks
Long Short Term Memory networks — usually just called “LSTMs” — are a special
kind of RNN, capable of learning long-term dependencies. They were introduced
by Hochreiter & Schmidhuber (1997), and were refined and popularized by many
people in following work. They work tremendously well on a large variety of
problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem.
Remembering information for long periods of time is practically their default
behaviour, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural
network. LSTMs also have this chain like structure, but the repeating module has a
different structure. Instead of having a single neural network layer, there are four,
interacting in a very special way.
In the above diagram1, each line carries an entire vector, from the output of one node
to the inputs of others. The pink circles represent pointwise operations, like vector
addition, while the yellow boxes are learned neural network layers. Lines merging
denote concatenation, while a line forking denotes its content being copied and the
copies going to different locations.

LSTM Model
Here we have imported the necessary libraries for LSTM model building
To use keras on text data, we first have to pre-process it. For this, we have used glove
word embedding to embed the text into 200-dimensional vector. In above code, we
have used Bidirectional LSTM which is an extension of traditional LSTMs that can
improve model performance on sequential classification problems.
Please see below graphical representation of Bidirectional LSTMs which works
internally above code.

Bidirectional recurrent neural networks(RNN) are really just putting two independent
RNNs together. The input sequence is fed in normal time order for one network, and
in reverse time order for another. The outputs of the two networks are usually
concatenated at each time step, though there are other options, e.g. summation.
This structure allows the networks to have both backward and forward information
about the sequence at every time step.
Now let’s understand about fully connected layer and why do we require?
Fully Connected Layer
The fully connected layer takes the deep representation from the RNN/LSTM/GRU
and transforms it into the final output classes or class scores. This component is
comprised of fully connected layers along with batch normalization and optionally
dropout layers for regularization.
Why we need Dense Layers for above Model building?
After introducing neural networks and linear layers, and after stating the limitations of
linear layers, we introduce here the dense (non-linear) layers.
In general, they have the same formulas as the linear layers wx+b, but the end result
is passed through a non-linear function called Activation function.
The “Deep” in deep-learning comes from the notion of increased complexity resulting
by stacking several consecutive (hidden) non-linear layers. Here are some graphs of

the most famous activation functions:
From the below code, we have used 64 layers in the first dense layer with “relu” as
an activation function as our output always resides on positive side and it is a multi-
layer neural networks.
Output Layer
Based on the problem at hand, this layer can have either Sigmoid for binary
classification or Softmax for both binary and multi classification output.

Finally, we have used 9 layers in the second dense layer i.e. output layer with
‘softmax’ as an activation function and in this we have to use the number of layers as
9 because our target classification groups count is 9 as defined in data pre-processing.
With above, we have completed the model building and now let’s understand the
performance of the model on train & test data.
Model Scoring & Selection
Our model scoring and selection is based on the standard evaluation metrics Accuracy,
Precision, and F1 score, which are defined as follows:
where:
· TP represents the number of true positive classifications. That is, the records with the actual
label A that have been correctly classified, or „predicted”, as label A.
· TN is the number of true negative classifications. That is, the records with an actual label
not equal to A that have been correctly classified as not belonging to label A.
· FP is the number of false positive classifications, i.e., records with an actual label other than
A that have been incorrectly classified as belonging to category A.

· FN is the number of false negatives, i.e., records with a label equal to A that have been
incorrectly classified as not belonging to category A.
Now Let’s understand what is the performance of our model.
Let’s have a look on the precision, recall etc of each individual group accuracy. Before that
let’s understand what is precision, recall etc.

The combination of predicted and actual classifications is known as „confusion
matrix”. Against the background of these definitions, the various evaluation metrics
provide the following insights:
· Accuracy: The proportion of the total number of model predictions that were
correct.
· Precision (also called positive predictive value): The proportion of correct
predictions relative to all predictions for that specific class.
· Recall (also called true positive rate, TPR, hit rate or sensitivity): The proportion of
true positives relative to all the actual positives.
· False positive rate (also called false alarm rate): The proportion of false positives
relative to all the actual negatives (FPR).
· F1: (The more robust) Harmonic mean of Precision and Recall.
Let’s have a look at Confusion matrix as well.

Bi-Directional LSTM attempted on the data set after aggregating all groups with less than 200
frequency has given an accuracy of 71% and precision of 71%.
Now, let’s try CNN model and understand the model performance.
CNN(Convolutional Neural Network)
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm
which can take in an input image, assign importance (learnable weights and biases) to various
aspects/objects in the image and be able to differentiate one from the other. The pre-
processing required in a ConvNet is much lower as compared to other classification
algorithms. While in primitive methods filters are hand-engineered, with enough training,
ConvNets have the ability to learn these filters/characteristics.
The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in
the Human Brain and was inspired by the organization of the Visual Cortex. Individual
neurons respond to stimuli only in a restricted region of the visual field known as the
Receptive Field. A collection of such fields overlaps to cover the entire visual area.
From the above, we could see that Convolution, Max-Pooling layers are the major part of
CNN network. Let’s understand one by one

Convolutional Layer
Convolutional layers are the major building blocks used in convolutional neural networks.
A convolution is the simple application of a filter to an input that results in an activation.
Repeated application of the same filter to an input results in a map of activations called a
feature map, indicating the locations and strength of a detected feature in an input, such as an
image.
Max Pooling Layer
Maximum pooling, or max pooling, is a pooling operation that calculates the maximum, or
largest, value in each patch of each feature map. We can make the max pooling operation
concrete by again applying it to the output feature map of the line detector convolutional
operation and manually calculate the first row of the pooled feature map.

From the above, we could see that validation accuracy is 60% so we can understand that
CNN is not performing much in this classification problem.
Let’s try to understand what are our observations so far and the next course of actions.
Observations & Next Steps:
 Traditional ML models with tf-idf embeddings did not give encouraging results
 Multiple deep learning sequential models with Glove Embeddings were attempted
and results compared to arrive at the best model. The two best models are highlighted
below through their results.
 Bi-Directional LSTM attempted on the data set after aggregating all groups with less
than 200 frequency has given an accuracy of 71% and precision of 71%.
 The accuracy and precision was further improved to 73% to 76% respectively when
an ensemble of 7 Bi-LSTM was built.

 In both the above attempts, the precision for Group_0 was even higher i.e. 82% and
86% respectively
 Since Group_0 constitutes close to 50% of total population, with 82% precision, we
are likely to achieve an automation of close to 40% with Group_0 itself.
 With an average precision of 76%, we are expecting to save around 269 minutes of
reading time (for reading 8500 tickets raised manually), considering each document
takes 2.5 secs to read
 The team would continue to focus on experimenting to improve the precision further
and with it the level of automation
 Considering that this is a highly imbalanced dataset, it is always going to be a trade-
off between the automating routing of tickets to more groups or automating routing
more accurately to lesser groups
 The ensemble model gives us the best result and should be recommended for
installation in production environment
 Future work would include applying smote to make the data balanced and re-trying
the models to see if the results are improving
Comparison to Benchmark
With the models we have built we have been able to achieve more than 75% accuracy and
precision. This means that, 75% of our predictions are accurate which is a massive benefit to
the company in reducing manual routing of tickets. We started the project with an assumption
that if we can automate even 50% of the tasks, the company will be much better off. With the
application of AI we are being able to go well past 50% automation. We are achieving more
than 75% automation. Another important thing to note is that for Group_0 which constitutes
close to 50% of total tickets, we have achieved more than 85% precision. Putting all this
together, we can confidently say that we have exceeded the benchmark laid out at the outset.
Visualisations
In addition to quantifying your model and the solution,
please include all relevant visualizations that support the ideas/insights that you gleaned
from the data?
Implications
Classification of text is a very common problem in the corporate world. Traditionally,
assigning of tickets, customer queries etc. to appropriate team was majorly done manually,
which is cumbersome and non-value added. With the advancements in AI, all corporations
should look forward to application of NLP solutions to automate these tasks with fairly
good levels of accuracy. It helps with automation of redundant nonvalue-added activities
thereby releasing staff for more complex activities.
In this project, we were able to achieve automated classification of tickets received to
appropriate teams with an accuracy of 73% and precision of 75%. Considering all the deep
learning models built that can be used for classification, we have a confidence interval of
70% to 73% for accuracy. We are achieving close to 86% with group_0 which is helping us
achieve automation for more than 40% of the total tickets.

The bottom line for us in this project
is we may be sacrificing automation of ticket assignment which are rare but will automate the
ticket assignment of more frequent issues which will still be a great help to the company
Since Group_0 constitutes close to 50% of the
total population, with 86% precision, we are likely to achieve an automation of close to 40%
with Group_0 itself.
With an average precision of 76%, we are expecting to save around 269 minutes of reading ti
me (for reading 8500 tickets raised manually), considering each document takes 2.5 secs to re
ad
Limitations
The dataset had only 8500 observations and was highly imbalanced. We had 74 classes in the
target variable with only class constituting about 50% of the total observations.
There are 55 groups who have a number of observations equalling to less than 1% of
total observations. This essentially turns out to be a problem of high cardinality with the
majority of classes with very little observations for any machine learning algorithm to learn
the patterns and predict or classify accurately on unseen data.
We have a lot of ticket assignments with a frequency of less than 100.We have only 9 classes
when we aggregate all groups with less than 200 assignments
This was our best approach considering the above limitation.
This will give any machine learning algorithm a better chance
to learn the underlying patterns in the data.
The category with all these aggregated tickets will continue to be required to be manually rev
iewed. But the rest of the categories may be possible to be automated for ticket assignment.
However, all the tickets getting tagged to the common group should be monitored and
information collected till a point we have enough number of observations for all the
underlying classes.
Once we have a good number of observations for all classes, we can recommend the
company to revisit the solution for a more all-inclusive solution.
Closing Reflections
"Not everything that counts can be counted, and not everything that can be counted counts" -
Albert Einstein
Natural Language Processing (NLP) is a subfield of artificial intelligence that helps
computers understand human language. Using NLP, machines can make sense of
unstructured data so that we can gain valuable insights.
We use explicit and implicit rules to convey an idea during conversations. Most of the words
we use to do this are flowing out of our mouth just because of the probability that a given
word will come after or before another word we are using. This probability can be reverse-
engineered from a corpus of things we say or write.
Deep Learning and NLP algorithms help us achieve this and with it we can train machines to
understand text. This helps in automating many manual tasks like reading and classifying

texts to a theme. Companies use resources to read complaints received from customers and
route it to specific teams to address it. We may have to read issues reported and classify them
to root causes like accuracy, completeness or timeliness for further action. Companies may
have to read through customer reviews to identify the dissatisfied customers.
In this project, we learnt application of various NLP techniques to generate insights from the
text data, identify the problems associated with the dataset and then leverage multiple NLP
and Deep Learning techniques to solve a problem. We could calculate the readability
standard of texts, reading time per description and then the polarity of each description which
were useful insights.
We could then break the plain text to sequential numeric format by converting it to an
embedding (use pre-computed models) so that it can be fed to our deep learning models. This
project gave us the opportunity to experiment with multiple approaches and algorithms to
arrive at the final solution which gives us the best solution, given the limitations. Use of Bi-
Directional LSTM and Transformer layers seemed to have given us the best results and also
reiterated the common sense in knowing that words with the context convey more than words
themselves. It was seen that hybrid models like combination of CNN and lstm/gru/bi-lstm
can also be applied which can sometimes give good results, though not in this case.
Lastly we saw that the ensemble of Bi-LSTM gave us the best results indicating that synergy
amongst resources can always give better results than what can be achieved from one good
resource.

Automation of IT Ticket Automation using NLP and Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automation of IT Ticket Automation using NLP and Deep Learning

Similar to Automation of IT Ticket Automation using NLP and Deep Learning (20)

More from Pranov Mishra

More from Pranov Mishra (6)

Recently uploaded

Recently uploaded (20)

Automation of IT Ticket Automation using NLP and Deep Learning